Reducing  the  Cost  of  System  Administration  of  a  Disk  Storage 
System  Built  from  Commodity  Components 


Satoshi  Asami 


Report  No.  UCB/CSD-00-1100 

May  2000 

Computer  Science  Division  (EECS) 
University  of  California 
Berkeley,  California  94720 


Report  Documentation  Page 


Form  Approved 
0MB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 


1.  REPORT  DATE 

MAY  2000 


2.  REPORT  TYPE 


3.  DATES  COVERED 

00-00-2000  to  00-00-2000 


4.  TITLE  AND  SUBTITLE 


5a.  CONTRACT  NUMBER 


5b.  GRANT  NUMBER 


5c.  PROGRAM  ELEMENT  NUMBER 


5d.  PROJECT  NUMBER 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 


Reducing  the  Cost  of  System  Administration  of  a  Disk  Storage  System 
Built  from  Commodity  Components 

6.  AUTHOR(S) 


7.  PEREORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES)  8.  PERFORMING  ORGANIZATION 

University  of  California  at  Berkeley, Department  of  Electrical  report  number 

Engineering  and  Computer  Sciences, Berkeley, CA, 94720 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES)  10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

This  dissertation  explores  how  to  reduce  the  system  administration  cost  of  disk  storage  systems.  There  are 
several  reasons  why  reducing  the  operator?s  burden  is  the  key  to  success  of  large  storage  systems.  One  is 
that  the  cost  of  system  administration  usually  dominates  the  budget  of  storage  systems.  Another  is  that  an 
operator  error  on  storage  systems  can  easily  have  disastrous  results.  In  the  field  of  physiology  and 
psychology,  there  have  been  studies  that  show  reducing  mental  and  physical  stress  on  the  operator  is 
crucial  in  preventing  human  errors.  This  dissertation  describes  Tertiary  Disk,  a  large-scale  disk  array 
system  built  from  commodity  components,  and  how  we  evaluated  the  feasibility  of  its  design.  Instead  of 
incurring  the  cost  of  custom  hardware,  we  attempt  to  solve  various  problems  by  design  and  software. 
Tertiary  Disk  is  a  cluster  of  storage  nodes  connected  by  switched  Ethernet.  Each  storage  node  is  a  PC 
hosting  a  few  dozen  SCSI  disks,  running  the  EreeBSD  operating  system.  The  system  is  used  as  a  web-based 
image  server  for  the  Zoom  Project  in  cooperation  with  the  Eine  Arts  Museums  of  San  Erancisco.  Our 
system  is  fully  redundant  in  both  hardware  and  software,  and  is  designed  to  avoid  a  single  point  of  failure. 
There  are  several  approaches  to  lower  the  human  cost  of  system  administration.  One  is  to  make  the  system 
as  autonomous  as  possible.  I  have  designed  a  self-maintenance  extension  to  the  operating  system  to  make 
the  system  run  continuously  in  the  event  of  failures.  There  are  also  several  other  improvements  to  the 
system  to  make  the  operator?s  job  easier.  Einally,  we  will  prove  the  feasibility  of  our  system  by  evaluating 
it  by  simulation.  Eailure  data  that  has  been  collected  on  Tertiary  Disk  over  the  course  of  several  years  were 
used  to  design  an  event  generator.  The  second  program,  a  simulator,  models  the  system  using  a  directed 
acyclic  graph  and  computes  its  availability  by  solving  a  connectivity  problem.  The  results  have  shown  that 
our  system  performs  as  expected  with  the  current  set  of  parameters,  and  also  expands  nicely  into  the 
future. 


15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

1 9a.  NAME  OE 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

107 

standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Reducing  the  Cost  of  System  Administration  of  a  Disk  Storage  System  Built  from 

Commodity  Components 


by 

Satoshi  Asami 

B.S.  (University  of  Tokyo)  1989 
M.S.  (University  of  Tokyo)  1991 

A  dissertation  submitted  in  partial  satisfaction  of  the 
requirements  for  the  degree  of 
Doctor  of  Philosophy 

in 

Computer  Science 
in  the 

GRADUATE  DIVISION 
of  the 

UNIVERSITY  of  CALIEORNIA  at  BERKELEY 

Committee  in  charge: 

Professor  David  A.  Patterson,  Chair 
Professor  Robert  Wilensky 
Professor  J.  George  Shanthikumar 

2000 


The  dissertation  of  Satoshi  Asami  is  approved: 


Chair  Date 


Date 


Date 


University  of  California  at  Berkeley 


2000 


Reducing  the  Cost  of  System  Administration  of  a  Disk  Storage  System  Built  from 

Commodity  Components 


Copyright  2000 
by 

Satoshi  Asami 


1 


Abstract 


Reducing  the  Cost  of  System  Administration  of  a  Disk  Storage  System  Built  from 

Commodity  Components 

by 

Satoshi  Asami 

Doctor  of  Philosophy  in  Computer  Science 
University  of  California  at  Berkeley 
Professor  David  A.  Patterson,  Chair 

This  dissertation  explores  how  to  reduee  the  system  administration  eost  of  disk  storage 
systems.  There  are  several  reasons  why  redueing  the  operator’s  burden  is  the  key  to  sueeess  of 
large  storage  systems.  One  is  that  the  eost  of  system  administration  usually  dominates  the  budget 
of  storage  systems.  Another  is  that  an  operator  error  on  storage  systems  can  easily  have  disastrous 
results.  In  the  field  of  physiology  and  psychology,  there  have  been  studies  that  show  reducing  mental 
and  physical  stress  on  the  operator  is  crucial  in  preventing  human  errors. 

This  dissertation  describes  Tertiary  Disk,  a  large-scale  disk  array  system  built  from  com¬ 
modity  components,  and  how  we  evaluated  the  feasibility  of  its  design.  Instead  of  incurring  the  cost 
of  custom  hardware,  we  attempt  to  solve  various  problems  by  design  and  software.  Tertiary  Disk 
is  a  cluster  of  storage  nodes  connected  by  switched  Ethernet.  Each  storage  node  is  a  PC  hosting  a 
few  dozen  SCSI  disks,  running  the  EreeBSD  operating  system.  The  system  is  used  as  a  web-based 
image  server  for  the  Zoom  Project  in  cooperation  with  the  Eine  Arts  Museums  of  San  Erancisco. 
Our  system  is  fully  redundant  in  both  hardware  and  software,  and  is  designed  to  avoid  a  single  point 
of  failure. 

There  are  several  approaches  to  lower  the  human  cost  of  system  administration.  One  is 
to  make  the  system  as  autonomous  as  possible.  I  have  designed  a  self-maintenance  extension  to 
the  operating  system  to  make  the  system  run  continuously  in  the  event  of  failures.  There  are  also 
several  other  improvements  to  the  system  to  make  the  operator’s  job  easier. 

Einally,  we  will  prove  the  feasibility  of  our  system  by  evaluating  it  by  simulation.  Eailure 
data  that  has  been  collected  on  Tertiary  Disk  over  the  course  of  several  years  were  used  to  design 
an  event  generator.  The  second  program,  a  simulator,  models  the  system  using  a  directed  acyclic 
graph  and  computes  its  availability  by  solving  a  connectivity  problem.  The  results  have  shown  that 
our  system  performs  as  expected  with  the  current  set  of  parameters,  and  also  expands  nicely  into 
the  future. 


Professor  David  A.  Patterson 
Dissertation  Committee  Chair 


Dedication 


To  my  parents — I  wouldn’t  be  here  without  you. 


IV 


Contents 

List  of  Figures  vii 

List  of  Tables  ix 

1  Introduction  1 

1 . 1  Have  Computers  Really  Beeome  Cheaper? .  1 

1.1.1  Building  Large  Storage  Systems .  3 

1.1.2  System  Administration  Cost .  3 

1.1.3  OurApproaeh .  3 

1.1.4  Applieability .  4 

1.2  Outline  of  Dissertation .  4 

1.3  Related  Work .  5 

1.3.1  Disk  Storage  Systems  .  5 

1.3.2  Fast  Reeovery .  7 

1.3.3  Monitoring  and  Diagnosis .  8 

1.3.4  Self-Management  of  Storage .  8 

1 .4  Researeh  Issues/Contributions  .  9 

2  Motivation  10 

2.1  Overview .  10 

2.1.1  Baekups .  10 

2.1.2  Redundaney .  11 

2.2  The  Impaet  of  System  Administration .  12 

2.2.1  Comparison  with  Hardware  Cost .  12 

2.2.2  The  Risk  Faetor .  13 

2.2.3  Availability .  14 

2.3  Our  Approaeh .  15 

2.4  Summary .  15 

3  The  System  16 

3.1  The  Applieation .  16 

3.1.1  Problems .  16 

3.1.2  GridPix .  17 

3.1.3  Status .  17 


V 


3.2  Tertiary  Disk  Architecture  .  19 

3.2.1  Commodity  Components .  21 

3.2.2  Redundancy .  21 

3.2.3  Topology .  22 

3.2.4  Disk  Interface .  23 

3.3  Software  Architecture .  27 

3.3.1  Handling  User  Requests .  28 

3.3.2  Operating  System  Support .  30 

3.4  Summary .  33 

4  The  Cost  of  System  Administration  34 

4.1  Self-Maintaining  System .  35 

4.1.1  Definition  of  Self-Maintaining  System .  35 

4.1.2  Requirements .  37 

4.2  A  System  That  Is  Easy  To  Maintain  .  37 

4.2.1  Disk  Identification .  38 

4.2.2  Taking  Advantage  of  Redundancy .  38 

4.2.3  Modular  Upgrading .  39 

4.3  Summary .  40 

5  Failures  41 

5.1  Collecting  Data .  41 

5.1.1  System  Log .  41 

5.1.2  HTTP  Server  Log .  43 

5.1.3  Repair  Records .  43 

5.2  Lailure  Summary .  44 

5.3  Hardware  Error  Details .  45 

5.3.1  PC .  45 

5.3.2  Network .  47 

5.3.3  The  SCSI  Subsystem .  48 

5.4  Software  Errors .  51 

5.4.1  Operating  System .  51 

5.4.2  Application .  51 

5.5  Summary .  52 

6  Validation  53 

6.1  Simulation  Setup .  54 

6.1.1  The  Generator  .  55 

6.1.2  The  Simulator  .  57 

6.2  Alternatives  to  Simulation  .  63 

6.3  Simulation  Results  and  Exploration  .  63 

6.3.1  Overview .  64 

6.3.2  Eailure  Rates .  64 

6.3.3  Operator  Cost .  68 

6.3.4  Repair  Interval .  69 


VI 


6.3.5  Disk  Striping .  72 

6.3.6  Mirroring .  74 

6.3.7  RAID-5 .  75 

6.3.8  Double-Ending .  75 

6.3.9  Operating  System  Crashes .  78 

6.3.10  Number  of  PCs  per  Disk .  80 

6.3.11  Total  Size .  80 

6.3.12  99.999%  or  99.9999%  Availability .  83 

6.4  Summary .  84 

7  Conclusion  85 

7. 1  Results .  85 

7.2  Contributions .  85 

7.3  What  We  Would  Have  Done  Differently .  86 

7.3.1  Disk  Interfaee .  86 

7.3.2  Disk  Purehases .  86 

7.3.3  Disk  Enelosures  .  87 

7.3.4  PC  Cases .  87 

7.4  Euture  Work .  87 

7.4.1  Completion .  88 

7.4.2  Design  Parameters .  88 


Bibliography 


89 


vii 


List  of  Figures 

3.1  Three-layer  GridPix  image .  18 

3.2  GridPix  interfaee  .  18 

3.3  Tertiary  Disk  Final  Prototype .  20 

3.4  Double-Ending  of  SCSI  disks .  21 

3.5  Double-Ending  with  feed-through  terminators .  22 

3.6  Power  eonneetion  and  mirroring  topology  on  a  system  with  eonstant  termination  .  .  24 

3.7  Power  eonneetion  and  mirroring  topology  on  a  system  without  eonstant  termination  25 

3.8  Ethernet  eonneetion  and  mirroring  topology .  26 

3.9  Handling  user  requests .  28 

3.10  Erontend  switehing  protoeol  .  29 

5.1  Sample  system  log  .  42 

5.2  SCSI  disk  read/write  errors .  49 

6.1  A  simple  eonneetivity  graph .  54 

6.2  Ethernet  eables .  55 

6.3  Sample  generator  output .  57 

6.4  OR-node .  58 

6.5  AND-node .  59 

6.6  Data,  datasets  and  striped  sets  .  60 

6.7  Power  supplies  .  61 

6.8  Redundant  power  supplies  .  61 

6.9  SCSI  drive  failure  rates  and  availability  .  65 

6.10  SCSI  drive  failure  rates  and  MTBE .  65 

6.11  IDE  drive  failure  rates .  66 

6.12  UPS  failure  rates  .  67 

6.13  Ethernet  switeh  failure  rates  .  67 

6.14  SCSI  adapter  failure  rates .  68 

6.15  Effeet  of  eomponent  failure  rates .  69 

6.16  Operator  time .  70 

6.17  Visits  that  result  in  repair .  70 

6.18  Repair  interval  and  availability .  71 

6.19  Repair  interval  and  MTBE .  71 

6.20  Eixed  Interval  and  Pager  Arrangements  .  73 


viii 

6.21  Stripe  set  sizes .  74 

6.22  Disk  failure  rates  and  mirroring .  75 

6.23  SCSI  drive  failure  rates  and  double-ending .  76 

6.24  IDE  drive  failure  rates  and  double-ending .  77 

6.25  Size  of  striped  sets  and  double-ending .  77 

6.26  Repair  interval  and  double-ending .  78 

6.27  Mirroring  and  double-ending .  79 

6.28  Operating  system  erashes  and  availability  .  79 

6.29  Operating  system  erashes  and  MTBF .  80 

6.30  Changing  number  of  PCs .  81 

6.31  Increasing  the  total  system  size .  82 

6.32  Reducing  the  repair  interval .  82 

6.33  Reducing  the  number  of  disks  per  striped  set .  83 


IX 


List  of  Tables 

1.1  Comparison  of  a  $2,000  PC  in  1991  and  1999  .  2 

3.1  Frontend  switching  effects .  29 

3.2  Reboot  steps  and  timings .  32 

5.1  Log  durations .  42 

5.2  Frequencies  of  failures .  44 

6.1  Simulator  nodes .  53 

6.2  Generator  general  options .  56 

6.3  Generator  component  options .  56 

6.4  Connectivity  factors .  60 

6.5  Simulator  parameters .  62 


X 


Acknowledgements 

During  my  years  in  graduate  sehool  many  people  have  helped  me  professionally  and  personally. 
Foremost  among  those  people  is  my  advisor,  David  Patterson.  He  has  taught  me  to  look  at  the  big 
pieture  and  work  hard.  He  has  also  taught  be  the  value  of  patienee  and  exereised  it  in  guiding  me 
through  the  program.  I  am  extremely  lueky  to  have  had  sueh  an  understanding  and  inspirational 
advisor. 

I  would  also  like  to  thank  the  other  readers  of  this  thesis,  Robert  Wilensky  and  J.  George 
Shanthikumar,  for  their  detailed  and  thoughtful  comments.  Robert  also  helped  me  find  a  new  loving 
home  for  the  images,  so  I  can  leave  the  university  with  a  peace  of  mind.  George’s  lectures  were 
always  theoretically  sound  while  also  down  to  the  details,  which  I  enjoyed  immensely. 

The  other  professors  involved  with  the  Network  of  Workstations  (NOW)  projects  deserve 
my  special  thanks:  Tom  Anderson  and  David  Culler.  They  have  skillfully  guided  the  toddling 
Tertiary  Disk  project  in  its  infancy.  Jim  Gray  of  Microsoft  was  the  instigator  of  the  Tertiary  Disk 
project.  The  project  would  not  have  existed  without  him. 

The  people  at  the  Fine  Arts  Museums  at  San  Francisco  gave  us  a  wonderful  collection  of 
pictures  to  store  in  our  (initially  empty)  storage  system.  As  they  say  in  Japan,  a  statue  is  worthless 
without  a  heart.  I  would  especially  like  to  thank  Bob  Futernick,  Dakin  Hart  and  Sue  Grinols  for 
entrusting  us  with  their  precious  pictures  and  helping  us  understand  what  it  takes  to  handle  fine 
art  objects.  Cal  students  Victor  Wong,  Tony  Le,  Wen-Kai  Zhong,  Gabriela  Hernandez,  Nicholas 
Huynh,  Lila  Tretikov,  Abby  Thompson  and  Shankara  Lowe  helped  us  prepare  the  images  for  public 
viewing. 

I  would  also  like  to  thank  the  many  professors  who  have  made  my  time  here  at  Berkeley 
enjoyable  with  great  classes.  Randy  Katz’s  VLSI  design  course  was  the  best  class  I  have  taken 
in  any  level.  Jim  Pitman  and  Max  Mendel  helped  me  understand  the  basics  and  applications  of 
statistics,  with  many  outrageous  examples  which  I  will  never  forget.  Richard  Newton  opened  my 
eyes  to  finer  points  of  computer  design.  Larry  Rowe  taught  me,  among  other  things,  that  a  good 
user  interface  is  just  as  important  as  the  internals  of  the  system  itself.  Eugene  Lawler  showed  me 
the  value  of  theoretical  thinking.  I  wish  he  could  be  at  the  commencement  to  share  the  moment  with 
us. 

Terry  Lessard-Smith,  Bob  Miller,  Kathryn  Crabtree  and  Peggy  Lau  also  deserve  special 
thanks  for  straightening  out  the  many  problems  I  had.  Eric  Eraser  was  very  helpful  in  taking  care  of 
the  machines  during  the  project’s  early  days,  while  Kevin  Mullally  was  the  one  who  first  taught  me 
the  importance  of  system  administration. 

The  people  in  the  EreeBSD  project  were  very  helpful  in  providing  fixes  and  workarounds 
to  the  problems  we  encountered.  Justin  Gibbs  and  Ken  Merry  deserve  special  thanks  for  their  work 
on  the  SCSI  system,  Stefan  Esser  for  his  PCI  bridging  support,  and  Bruce  Evans  for  his  guidance 
through  the  kernel  interface  and  build  process. 

I  enjoyed  sharing  477  Soda  with  a  lot  of  people,  including  Nisha  Talagala  and  Remzi 
Arpaci-Dusseau.  Nisha  has  also  been  my  project  partner  and  was  a  great  help  in,  among  other 
wonderful  things,  breaking  many  disks.  Remzi  supplied  the  often  dry  room  with  a  good  sense  of 
humor  as  well  as  a  constant  stream  of  soft  drinks.  Eric  Anderson,  like  a  good  neighbor,  was  always 
around  when  I  needed  a  little  expert  advice. 

My  friends  May  Cheng,  Ritsuko  Nakamura  and  Matt  O’Hara  helped  me  get  through  dif¬ 
ficult  times.  I  don’t  know  where  I  would  be  without  them.  My  parents,  Eumi  and  Susumu  Asami, 


have  been  very  supportive  throughout  my  entire  life  and  have  always  believed  in  me.  They  have 
given  their  only  son  an  opportunity  to  pursue  his  dreams  on  a  foreign  land.  I  am  happy  to  be  able 
to  show  them  that  I  can  finish  what  I  started  and  hope  that  I  have  brought  honor  to  the  family  name 
for  them. 

Last  but  not  least,  my  wife  Margaret  Asami  has  been  my  best  supporter  in  my  last  few 
years.  No  matter  how  busy  she  is  at  work,  she  always  takes  the  time  to  listen,  encourage,  cheer  me 
up  and  clean  up  after  dinner,  while  enduring  the  many  long  nights  I  had  to  go  through.  I  simply 
can’t  thank  her  enough. 

This  research  was  funded  by  DARPA  Roboline  grant  N00600-93-K-2481  and  the  State  of 
California  MICRO  program.  I  would  also  like  to  thank  IBM  and  Intel  for  donating  disks  and  PCs, 
respectively,  to  help  build  the  Tertiary  Disk  cluster. 


1 


Chapter  1 

Introduction 


Ten  years  ago,  large  disk  storage  systems  were  built  as  speeialized,  large-iron  server  sys¬ 
tems.  They  eombined  the  fastest  eentral  proeessor  units  (CPU)  available  on  workstations  with 
speeial  buses  and  expensive  memory  systems.  Server  priee  tags  were  mueh  higher  than  those  of  the 
average  workstation.  People  were  still  eontent  sinee  all  eomputers  were  expensive  and  they  needed 
only  one  of  these  servers  for  dozens  of  workstations. 

Sinee  then,  priees  of  eomputer  parts  have  been  dropping  rapidly,  espeeially  in  the  past 
few  years.  For  instanee,  the  eost  of  disk  per  unit  eapaeity,  usually  measured  in  dollars  or  eents  per 
megabyte,  has  been  steadily  dropping  at  the  rate  of  60-100%  per  year  in  sinee  1991.  Deereases  in 
eost  of  semieonduetor  memory,  CPU,  and  other  parts  are  similarly  signifieant. 

1.1  Have  Computers  Really  Become  Cheaper? 

However,  the  priees  of  eomputers  haven’t  really  gone  down — we  now  simply  have  aeeess 
to  mueh  better  eomponents  for  the  same  priee.  Ten  years  ago,  you  eould  get  a  deeent  eomputer  for 
$1,000,  and  you  ean  still  get  a  deeent  eomputer  for  $1,000  today;  the  only  thing  that  ehanged  is  the 
meaning  of  the  the  word  “deeent”.  Table  1.1  shows  a  eomparison  of  a  better-than-average  PC  that 
you  ean  purehase  for  $2,000  in  late  1991  and  late  1999.  The  1991  PC  was  the  first  PC  I  purehased 
when  I  eame  to  the  United  States.  There  are  some  PCs  now  that  sell  mueh  eheaper  than  $2,000,  but 
they  are  not  suffieient  as  building  bloeks  for  storage  systems  so  I  deeided  to  not  list  them  here. 

To  summarize,  we  ean  purehase  a  PC  that  has  32  times  as  mueh  main  memory,  85  times 
as  mueh  disk  spaee  and  128  times  as  mueh  video  memory  for  about  the  same  priee  as  we  did 
eight  years  ago.  The  monitor  is  now  eapable  of  redrawing  more  than  twiee  as  many  sean  lines  per 
seeond.  Comparing  the  speed  of  eomputers  is  harder,  as  there  are  many  faetors  that  affeet  the  real 
performanee — for  instanee,  you  ean’t  just  say  that  the  600MHz  Pentium  IE  in  1999  is  18  times  faster 
than  the  33MHz  80386  in  1991  just  by  eomparing  the  eloek  rates,  as  they  have  different  registers, 
instruetion  sets,  eycles  per  instruetion,  and  so  on  [HP96a].  However,  from  the  numbers  listed  in 
Table  1 . 1  we  ean  see,  for  instanee,  that  the  main  system  bus  is  eapable  of  transferring  eight  times 
as  mueh  data  per  seeond  and  the  disk  sequential  transfer  speed  has  improved  by  a  faetor  of  30.  The 
improvement  in  speed  of  individual  eomponents  appear  to  be  by  a  faetor  of  around  10  to  30  times. 

It  is  also  notable  that  the  disk  subsystem  has  improved  signifieantly.  The  individual  disks 
have  inereased  their  size  by  a  faetor  of  two  magnitudes,  and  the  data  transfer  rate  of  the  bus  has 


< 

V. 

Category 

subsystem 

Name 

Specification 

1991  1  1999 

CPU 

Class 

80386 

Pentium  IE 

Internal  clock  rate 

33MHz 

600MHz 

External  clock  rate 

33MHz 

lOOMHz 

External  bus  width 

32  bits 

64  bits 

Cache 

On-chip 

(none) 

32KB 

Off-chip 

64KB 

512KB 

Main  memory 

Type 

DRAM 

SDRAM 

Size 

4MB 

128MB 

System  bus 

Type 

ISA 

PCI 

Clock  rate 

8.3MHz 

33MHz 

Width 

16  bits 

32  bits 

Disk  interface 

Type 

IDE 

SCSI 

Bus  width 

8  bits 

16  bits 

Max.  transfer  speed 

3MB/S 

160MB/S 

Devices  per  bus 

2 

15 

Hard  disk 

Size 

0.12GB 

10.2GB 

Sequential  access  speed 

0.3MB/S 

lOMB/s 

Average  seek  time 

15ms 

7ms 

Video  card 

Video  RAM  size 

0.25MB 

32MB 

Monitor 

Screen  size 

14  inch 

17  inch 

Max.  horizontal  sync 

36kHz 

82kHz 

Table  1.1:  Comparison  of  a  $2,000  PC  in  1991  and  1999 


3 


increased  by  a  factor  of  50.  It  is  also  now  possible  to  attach  more  disks  to  a  single  bus. 

1.1.1  Building  Large  Storage  Systems 

Given  this  new  situation,  it  appears  it  is  now  feasible  to  build  very  large  disk  storage 
systems  using  commodity  hardware.  The  PCI  (Peripheral  Component  Interconnect)  bus  can  transfer 
up  to  132MB/S  in  burst  mode  and  can  be  chained  by  PCI  expansion  boxes  to  an  arbitrary  depth  to 
connect  dozens  of  expansion  cards  to  a  single  CPU  [Sha95].  The  SCSI  (Small  Computer  Systems 
Interface)  bus  can  connect  up  to  15  devices  to  a  single  host  adapter  and  transfer  160MB/s  total 
[X3T].  Combined  with  the  expandability  of  the  PCI  bus,  it  is  now  possible  to  connect  hundreds 
of  disks  to  a  single  PC  without  using  any  specialized  equipment.  Although  still  behind  in  floating¬ 
point  operations,  today’s  PCs  have  enough  integer  processing  power  to  put  to  shame  server  systems 
of  a  decade  ago.  The  Tertiary  Disk  project,  which  I  will  explain  in  detail  in  Chapter  3,  has  built  one 
such  system  entirely  from  commodity  components  [TAAP98]. 

1.1.2  System  Administration  Cost 

On  the  other  hand,  the  cost  of  administering  storage  systems  has  remained  high  over  the 
years.  For  instance,  a  1993  study  by  Strategic  Research  Corp.  shows  the  annual  cost  of  system 
administration  of  storage  systems  is  almost  three  times  that  of  the  cost  of  system  hardware  itself 
[SR93].  Without  significantly  reducing  the  cost  of  system  administration,  any  benefit  of  using 
commodity  hardware  to  build  storage  systems  will  quickly  diminish,  especially  if  we  are  to  build  a 
system  with  a  large  number  of  disks  to  take  advantage  of  the  new  interfaces,  instead  of  purchasing 
an  expensive  turnkey  solution  like  a  hardware  RAID  box. 

1.1.3  Our  Approach 

Our  solution  to  this  challenge  is  to  make  the  storage  system  self-maintaining  [ATP99]. 
Instead  of  having  someone  constantly  look  after  the  system,  possibly  with  an  around-the-clock 
pager  as  seen  on  some  installations,  our  system  is  designed  to  mask  problems  and  run  continuously 
even  when  there  are  several  independent  component  failures.  The  system  administrator’s  job  is  only 
to  come  in  for  scheduled  visits,  for  instance,  once  per  week,  to  replace  the  failed  components.  Such 
mode  of  operation  will  also  reduce  the  possibility  of  human  errors,  as  people  working  during  normal 
hours  are  less  likely  to  commit  errors  than  those  who  are  paged  and  have  to  drive  to  work  in  the 
middle  of  the  night.  Indeed,  Gray  found  that  operator  errors  are  as  plentiful  as  hardware  errors,  so 
this  scheme  will  improve  availability  as  well  [Gra90].  We  also  propose  another  possibility,  with 
an  operator  with  a  pager  but  without  the  obligation  to  come  in  immediately — if  the  page  arrives 
after  normal  work  hours,  that  person  can  just  show  up  next  morning  at  9  A.M.  We  will  show  that, 
with  our  design,  such  mode  of  operation  will  result  in  very  good  system  availability,  without  adding 
significantly  to  the  cost. 

The  main  application  of  our  system  is  a  web  server  holding  image  data  for  objects  of  art. 
In  that  context,  we  have  decided  to  adopt  a  policy  called  “repair  by  reload”.  For  any  single  failure, 
our  goal  is  to  mask  the  failure  within  five  seconds  so  that  a  retry  by  the  client  will  succeed.  The 
reason  we  picked  five  seconds  is  to  make  sure  it  is  short  enough  so  that  if  we  can  mask  the  problem 
within  that  time  period,  users  will  get  the  correct  web  page  when  they  get  impatient  and  hit  the 


4 


reload  button  on  their  browsers.  Beeause  of  the  nature  of  the  Internet,  sueh  short-term  failures 
are  likely  to  be  viewed  as  transient  problems  on  the  network  users  experienee  every  day,  rather 
than  a  problem  on  our  end.  It  is  virtually  impossible  for  users  to  distinguish  between  intermittent 
network  problems  and  very  short  server  outages. 

In  our  applieation,  end  users  will  not  issue  any  writes,  whieh  also  simplifies  our  task.  We 
have  used  the  approaeh  to  mirror  all  the  data  that  is  required  for  day-to-day  operation.  The  design 
issue  then  simply  beeomes  how  to  quiekly  deteet  failures  and  reroute  user  requests  to  the  baekup 
eopy. 

In  addition  to  making  the  system  self-maintaining,  I  have  explored  several  other  methods 
to  reduee  the  eost  of  system  administration.  For  instanee,  one  important  aspeet  of  a  self-maintaining 
system  is  that  it  ean  reboot  quiekly — the  longer  it  takes  to  reboot,  the  larger  the  window  of  vulner¬ 
ability  and  the  less  ehanee  it  will  eontinue  running  without  human  intervention.  By  eombining 
several  teehniques,  I  was  able  to  reduee  the  boot  time  from  22  minutes  to  about  1  minute. 

1.1.4  Applicability 

Although  some  aspeets  of  our  design  benefits  from  the  read-mostly  nature  of  the  appli¬ 
eation  and  our  position  as  a  web  server,  we  are  hardly  alone  in  providing  sueh  serviee.  There  are 
literally  hundreds  of  large  web  sites  on  the  Internet  that  speeialize  in  read-mostly  information  eon- 
tents,  as  well  as  more  traditional  anonymous  ftp  servers  and  gopher  servers.  To  extend  the  horizon  a 
little,  the  read-mostly  nature  is  eommon  in  many  other  types  of  large  storage  systems  sueh  as  eertain 
kinds  of  database  servers  as  well. 


1.2  Outline  of  Dissertation 

This  seetion  outlines  the  thesis  and  lists  its  overall  results  by  ehapter.  The  rest  of  this 
ehapter  lists  the  related  work  in  the  field.  Chapter  2  expands  on  fhis  ehapfer  and  illuslrafes  fhe 
mofivafion  behind  fhe  researeh. 

The  arehifeefure  of  fhe  system  we  are  using  for  fhe  experimenf  and  fhe  main  applieafion 
is  illusfraled  in  Chapfer  3.  The  researeh  was  eonduefed  on  a  elusfer  of  PCs  builf  by  fhe  Terfiary 
Disk  group  [TAAP98].  The  sysfem  eonsisfs  of  24  PCs  and  364  8GB  disks  for  a  fofal  raw  eapaeify  of 
3.2TB.  Four  of  fhe  maehines  are  various  fronfend  and  infrasfruefure  servers;  fhe  disks  are  eonneefed 
fo  fhe  ofher  20.  The  disk  servers  are  in  fwo  differenf  eonfigurafions;  fhe  “disk-heavy”  sef  wifh  35 
disks  per  maehine,  and  a  “disk-lighf”  sef  fhaf  has  16  disks  per  maehine. 

All  fhe  maehines  run  fhe  FreeBSD  operafing  sysfem  [FB93].  I  have  been  working  wifh  fhe 
developers  of  fhe  FreeBSD  projeef  who  have  been  helpful  in  fixing  bugs  and  making  enhaneemenfs 
fo  fhe  sysfem  fo  make  if  suifable  for  use  as  a  large-seale  sforage  sysfem.  We  explain  and  evaluate 
fhe  applieabilify  of  sueh  improvemenfs  fo  ofher  sysfems  in  Chapfer  3.  They  are  also  eompared  fo 
similar  mefhods  proposed  elsewhere. 

Our  main  applieafion  is  an  arf  image  dafabase  server.  Cooperafing  wifh  fhe  Fine  Arfs 
Museums  of  San  Franeiseo,  we  have  been  offering  piefures  of  objeefs  of  arf  fhrough  fheir  seareh 
engine  [TAP'’‘98].  Our  sife  holds  large  versions  of  fhe  images  fhaf  do  nol  fif  on  fheir  disk. 

In  Chapter  4,  we  elarify  fhe  eoneepf  of  self-mainfainabilily  and  explain  how  we  approaeh 
fhe  problem.  The  word  “self-mainfaining”  has  differenf  meanings  depending  on  fhe  poinf  of  view 


5 


of  the  two  groups  that  interact  with  storage  systems:  users  and  system  administrators.  Users  do  not 
know  what  is  going  on  inside  the  system;  they  just  expect  the  system  to  be  “up”  at  all  time  and  will 
get  easily  annoyed  if  it  isn’t.  Depending  on  the  application,  there  are  various  types  of  users,  with 
different  requirements.  For  a  public  service  site  such  as  ours,  it  is  essential  to  keep  user  annoyance 
to  a  minimum  as  unhappy  users  may  never  come  back. 

The  system  administrators  are  generally  more  understanding  than  the  end  user  to  the  prob¬ 
lem  of  the  system — some  of  them  are  even  architects  of  the  systems  themselves — but  the  amount  of 
the  time  these  people  have  to  spend  directly  affects  the  cost  of  keeping  the  system  up  and  running. 
Also,  the  more  pressure  you  put  on  this  person,  the  more  likely  there  will  be  human  errors,  and 
human  errors  by  system  administrators  can  have  disastrous  results.  Thus,  it  is  very  important  to 
ease  the  job  of  the  system  administrator  through  whatever  means  possible. 

We  validated  our  system  using  a  simulator.  There  were  three  steps  in  our  validation 
methodology:  collecting  data,  running  simulation  and  evaluation.  The  first  step  is  explained  in 
Chapter  5,  while  the  other  two  are  detailed  in  Chapter  6.  The  data  we  needed  to  collect  fell  in  two 
categories.  One  is  failure  information:  the  symptoms,  error  frequencies,  and  so  on.  The  other  is 
repairing  information,  i.e.,  how  long  it  takes  to  conduct  a  repair  on  a  certain  component. 

After  analyzing  the  data,  they  were  fed  to  an  event-driven  simulator  to  predict  the  behav¬ 
ior  of  the  system  over  a  longer  period  of  time.  The  simulator  builds  a  simple  directed  acyclic  graph 
(DAG)  to  represent  the  system  and  checks  the  connectivity  whenever  there  is  a  change  to  one  of 
the  components’  state.  We  also  changed  some  parameters  to  explore  the  design  space.  The  param¬ 
eters  include  numerical  ones,  such  as  repair  interval  and  degree  of  redundancy,  and  qualitative  ones 
such  as  the  importance  of  double-ending.  The  results  were  compared  with  data  available  for  other 
systems  to  prove  the  validity  of  our  design. 

Chapter  7  summarizes  the  results  and  observes  future  prospects  of  the  research  topic.  This 
chapter  also  gives  an  insight  on  what  we  would  have  done  differently  if  we  were  to  build  a  similar 
system  again. 

1.3  Related  Work 

I  have  listed  related  work  in  the  field  of  fasl  recovery,  sysfem  adminisfrafion  of  clustered 
sysfems,  in  parficular  moniforing  and  diagnosis,  and  large  disk  sforage  systems. 

1.3.1  Disk  Storage  Systems 

There  are  fhree  classes  of  disk  sforage  sysfems  fhaf  are  related  fo  my  research;  hardware 
RAID,  high-availabilily  server  sysfems  and  nefwork-affached  sforage  sysfems. 

Hardware  RAID 

RAID  (Redundanf  Array  of  Inexpensive  Disks)  is  a  mefhod  of  building  large  disk  stor¬ 
age  sysfems  from  smaller  disks  [CGP+88,  Gib92].  There  are  six  levels  of  RAIDs,  0  fhrough  5, 
differing  in  redundancy  schemes.  Mosf  widely  used  are  RAID-0,  which  is  simple  sfriping  wifh  no 
redundancy,  RAID-1,  which  is  mirroring,  and  RAID-5,  which  is  block-inferleaved  parify  wifhouf  a 
dedicafed  parity  disk.  There  are  many  companies — EMC,  Sony,  IBM,  BoxHill,  SforageTek,  Sun, 


6 


DEC,  to  name  a  few — selling  RAID-based  systems  on  the  market.  There  are  more  than  50  eom- 
panies  and  100  produets  in  the  RAID  Consortium,  and  eolleetively  they  ship  billions  of  dollars  of 
systems  eaeh  year. 

The  problems  with  eurrent  RAID  systems  are  twofold.  One  is  priee,  in  that  eustom  hard¬ 
ware  required  to  build  these  boxes  eost  mueh  more  than  eommodity  eomponents  as  we  are  using. 
The  other  is  expandability.  These  boxes  have  a  fixed  upper  limit  on  number  of  disks,  usually  top¬ 
ping  out  at  several  dozen  to  a  few  hundred  gigabytes,  and  will  foree  the  user  to  purehase  another  box 
when  the  limit  is  reaehed.  However,  inereasing  the  number  of  RAID  boxes  opens  up  a  whole  new 
set  of  problems,  as  the  reliability  deereases  linearly  with  the  number  of  boxes  while  the  maintenanee 
eost  and  eomplexity  inereases. 

As  mentioned  throughout  this  thesis,  these  are  the  problems  we  are  trying  to  solve  with 
our  system. 

High-Availability  Servers 

Many  vendors  offer  eomplete  server  systems  with  a  large  storage  eapaeity.  Tandem  has  a 
long  history  in  this  field.  Their  NonSfop  servers  are  fully  redundanf  and  have  hof-swappabilify  of 
mosf  eomponenfs,  fhus  if  is  nof  neeessary  fo  power  if  down  even  during  repairs  [BBC"’“90].  Their 
arehifeefure,  ealled  “shared  nofhing”,  has  no  single  poinf  of  failure,  i.e.,  any  one  eomponenf  ean 
fail  and  fhe  system  will  sfill  funelion.  Jim  Gray  did  a  sfudy  on  fheir  availabilify  in  fhe  late  1980’s 
[Gra90]. 

Their  soffware  uses  a  design  ealled  “proeess  pairs”  in  whieh  a  primary  proeess,  doing  all 
fhe  work  during  normal  operation,  periodieally  sends  synehronizafion  messages  fo  a  baekup  proeess 
[BarSl,  Tan97a].  When  fhe  primary  proeess,  or  fhe  hardware  fhaf  supporfs  if,  fails,  fhe  baekup 
process  will  fake  over.  Since  fhe  backup  process  does  nof  have  fo  doing  any  heavy  processing, 
fhere  is  very  liffle  overhead  during  normal  operafion.  This  design  is  similar  fo  fhe  way  our  fronfend 
servers  back  up  each  ofher  on  a  peer-fo-peer  basis. 

Their  system  adminisfrafion  suite  is  called  TMDS  (Tandem  Maintenance  and  Diagnostic 
System)  [Tan97b].  If  has  an  aufo-diagnosfics  feafure  fhaf  uses  an  arfificial  intelligence  program 
based  on  pasf  knowledge,  which  compares  fhe  sympfoms  of  fhe  system  fo  known  failure  modes  and 
fries  fo  idenfify  fhe  failures.  TMDS  can  optionally  dial  up  Tandem’s  fechnical  supporf  cenfer  wifh  a 
reporf  which  Tandem  technicians  can  consulf  while  running  remote  diagnostics  of  fheir  own. 

Network-Attached  Storage 

Network  Appliance  [NA92]  sells  network  servers  built  around  Digital’s  Alpha  chip.  Their 
NFS  servers  are  advertised  to  outperform  a  4-way  P6-200MHz  Windows  NT  system  by  2  to  10 
times.  They  use  RAID-4,  block-interleaved  parity  with  a  dedicated  parity  disk,  with  NVRAMs  to 
increase  performance.  They  have  a  filesystem  that  can  recover  from  crashes  very  quickly  by  using 
checkpointing  and  roll-forward  logs. 

Microsoft  Tiger  is  a  video  server  built  from  commodity  PCs  which  they  call  “cubs” 
[BBD+96,  BFD97].  Their  goal  is  to  tolerate  the  failure  of  any  one  cub  or  disk  without  notice¬ 
able  degradation  of  service.  They  use  mirroring  for  backing  up  data,  since  they  cannot  tolerate  the 
runtime  reconstruction  overhead  of  parity-based  redundancy  schemes.  They  place  the  “primary” 
copies  of  the  data  on  the  outer  tracks  of  the  disks  for  better  performance,  and  the  backup  copies 


7 


are  declustered  to  avoid  loading  a  single  maehine  or  disk  during  failures.  Tiger  distributes  all  files 
aeross  all  disks  on  all  eubs  for  maximum  striping  performanee,  but  this  method  has  a  drawbaek  of 
having  to  reeonstruet  data  on  the  entire  system  when  a  new  disk  is  added.  This  reeonstruetion  is  not 
very  expensive  as  there  is  no  parity  ealeulation  involved,  but  the  system  has  to  be  shut  down  for  a 
few  hours  while  the  reeonstruetion  takes  plaee. 

A  single  eontroller,  whieh  has  the  only  IP  address  known  from  outside  of  the  eluster, 
serves  as  the  entry  point  from  elients.  No  image  data  passes  through  the  eontroller  so  it  is  not  likely 
that  they  will  beeome  performanee  bottleneeks.  This  design,  with  a  maehine  with  a  single  IP  address 
known  to  the  outside  world  serving  as  a  frontend  that  does  not  handle  data,  is  similar  to  our  system, 
exeept  we  have  two  frontend  maehines  banking  up  eaeh  other  for  extra  availability. 

They  take  great  eare  to  ensure  that  suffieient  bandwidth  is  available  during  the  entire 
eourse  of  the  playbaek,  and  will  delay  start  of  playbaek  if  neeessary.  They  use  a  distributed  sehed- 
ule  management  protoeol  for  sealability.  Eaeh  eub  has  a  partial,  potentially  outdated  view  of  the 
sehedule  whieh  they  pass  around,  updating  it  along  the  way.  Cubs  use  deadman  protoeol  for  fault 
deteetion — eaeh  eub  sends  periodie  ping  to  the  eub  on  its  right. 

Petal  is  another  network-attaehed  storage  system  [LT96].  Petal  is  a  eolleetion  of  dis¬ 
tributed  servers,  eaeh  eontaining  multiple  disks.  They  use  a  method  ealled  chained  declustering 
to  avoid  having  the  load  inerease  100%  on  a  maehine  when  its  mirrored  eounterpart  fails  [HD89]. 
Petal  is  a  bloek  server.  The  authors  of  Petal  have  also  designed  a  distributed  file  sysfem  ealled 
Frangipani  fo  run  on  lop  of  Pelal  [TML97]. 

CMU’s  Nelwork  Allaehed  Seeure  Disks  (NASD)  is  anolher  example  of  nelwork  alfaehed 
storage  [GNA+96,  GNA+98].  Their  disks  have  more  infelligenee  fhan  eurrenl  disks,  and  fheir  foeus 
is  on  seeurily  and  safely  of  dafa. 

A  final  sample  is  xFS,  a  eombinalion  of  nelwork  slriping  and  a  Fog-slruelured  Filesystem 
(FFS  [SBMS85]),  whieh  is  a  melhod  of  redueing  Ihe  disk  lafeney  on  random  writes  [ADN+95]. 

Our  system  avoids  Ihe  eomplexily  of  dislribufed  filesyslems  by  using  HTTP  redireefs  and 
IP  masquerading  lo  dislribule  user  requesls  and  mask  failures.  In  addilion  to  Tiger,  Ihis  approaeh 
is  also  similar  fo  Ihe  inilial  melhod  employed  by  Ihe  Inklomi  projeel  al  UC  Berkeley,  whieh  was  a 
large  seareh  engine  buill  from  a  small  elusler  of  Sun  workslalions. 

There  have  been  many  sludies  lhal  express  eomplex  systems  using  a  graph.  For  example, 
Ren  and  Dugan  proposed  a  melhod  to  dynamieally  Irim  faull  frees  to  analyze  system  behavior 
effieienlly  [RD98] 

1.3.2  Fast  Recovery 

Mary  Baker  has  sludied  fasl  rebool  of  syslems  after  a  erash  for  her  Ph.D.  Ihesis  [Bak93]. 
Mosl  of  her  work  is  aboul  dislribufed  slate  reeovery  of  Ihe  Sprite  operating  system,  bul  she  has  also 
done  eonsiderable  amounl  of  work  to  slreamline  Ihe  normal  UNIX  slarl-up  proeedure  to  improve 
Ihe  rebool  time.  Her  Ihesis  is  lhal  a  system  lhal  reeovers  very  quiekly  ean  be  as  available  as  a  system 
lhal  erashes  less  frequenlly  bul  reeover  more  slowly.  I  have  made  similar  improvemenls  to  our  bool 
proeess  as  ean  be  found  in  seelion  3.3.2. 


1.3.3  Monitoring  and  Diagnosis 


Eric  Anderson  of  the  Network  of  Workstations  (NOW  [ACPtNT95])  Project  has  been 
studying  several  aspects  of  system  administration  of  clusters  [And97].  His  most  recent  work  in¬ 
troduces  a  system  called  CARD  (Cluster  Administration  using  Relational  Databases)  [AP97].  He 
proposes  using  relational  databases  to  build  an  extensible  monitoring  system.  CARD  uses  a  hybrid 
push-pull  protocol  to  collect  data  from  individual  machines.  His  emphasis  is  on  monitoring  and 
diagnosis. 

Swatch  [HA93]  is  a  push  system,  in  which  each  machine  sends  a  periodic  update  to  a  cen¬ 
tralized  server.  Swatch  modifies  UNIX  programs  to  emit  more  detailed  logs,  especially  pertaining 
security  information.  It  then  uses  sy slog  to  collect  the  information  and  send  them  to  the  server. 
They  provide  a  collection  of  perl  scripts  to  filter  the  logs  to  make  it  human-readable.  Swatch 
also  allows  administrators  to  define  fheir  own  functions  using  scripfs  fo  focus  on  some  parfs  more 
carefully. 

Tkined  [Sch95]  is  anofher  cenfralized  system.  Tkined  has  an  extensive  collection  of  mefh- 
ods  for  gafhering  dafa.  Because  if  is  disfribufed  wifh  complete  source  code,  if  can  be  exfended  by 
modifying  fhe  program.  Since  fhe  dafa  is  nol  accessible  oufside  of  fhe  Tkined  program,  new  mod¬ 
ules  eifher  have  fo  be  added  fo  Tkined,  or  have  fo  repeaf  fhe  dafa  gafhering.  Tkined  provides  simple 
supporf  for  visualizafion  and  does  nol  aggregale  dafa  before  displaying  if.  Tkined’s  cenfralized  pull 
model  limifs  ifs  scalabilify. 

Pulsar  [Fin97]  uses  pulse  monitors — small  Tcl/Tk  [Ous94]  scripfs-fo  collecf  information 
abouf  sysfems.  If  fhe  measured  value  is  nol  wilhin  fhe  predefined  bounds,  a  reporl  is  senl  fo  a  cenlral 
server.  Pulsar  is  extensible  bul  nol  faull  loleranl  due  fo  ifs  cenfralized  design.  We  are  using  Tcl/Tk 
and  ifs  extension  Expecf  for  some  of  our  system  adminislralion  fools. 

Nisha  Talagala  of  fhe  Tertiary  Disk  Projecf  has  sludied  fhe  failures  we  encountered  on  our 
syslem  as  part  of  her  Ph.D.  Ihesis  [Tal99].  She  has  concenlraled  on  six  monlhs  of  fhe  logs  from  fhe 
syslem  and  has  analyzed  Ihem  in  delail.  I  have  done  a  similar  analysis  in  Chapter  5  for  a  longer 
period  of  time  and  a  differenl  emphasis  as  a  basis  of  my  simulalion. 

SCSI-3  has  a  new  fealure  called  SCSI  Environmenlal  Services  (SES).  If  can  be  used  fo 
monitor  information  such  as  slalus  of  enclosure  power  supply  or  fan  and  lemperalure  of  various 
componenls,  as  well  as  send  commands  to  conlrol  devices  such  as  enclosure  buzzers  and  EEDs.  If 
would  have  been  useful  for  our  projecf  if  if  were  available  when  we  builf  fhe  prololype. 

1.3.4  Self- Management  of  Storage 

Mosl  of  fhe  lileralure  lhal  deal  wifh  reducing  fhe  cosl  of  syslem  adminislralion  of  slor- 
age  accomplish  fhe  goal  by  aulomaled  managemenl  of  slorage  devices.  Allhough  fhe  emphasis  of 
Ibis  Ihesis  is  nol  in  lhal  aspecf  of  syslem  adminislralion,  some  of  fhe  mosl  prominenl  projecls  are 
mentioned  here  as  we  share  fhe  common  goal  of  reducing  fhe  system  administrator’s  burden. 

The  Storage  Systems  Program  (SSP)  at  HP  is  working  on  a  system  called  “self-management”! 
of  storage  [BGM+96].  Their  focus  is  on  automatic  assignment  of  storage  devices;  humans  do  not 
have  to  worry  where  to  put  what.  Their  prototype  system  is  capable  of  assigning  several  thousands 
of  objects  to  devices  in  a  few  minutes. 

Sun’s  Jini  makes  devices,  including  disks,  identify  themselves  as  part  of  their  automatic 
component  discovery  paradigm  [Wal98].  The  devices  either  have  enough  intelligence  built  in  them 


9 


to  be  able  to  identify  themselves  to  the  elients,  or  they  will  download  the  drivers  from  a  server. 
There  is  also  a  reeent  projeet  in  Sun  ealled  Jiro,  whieh  defines  a  storage  management  interfaee 
to  let  eomponents  manufaetured  by  different  vendors  eooperate  to  provide  an  easily  manageable 
environment.  Their  draft  is  eurrently  undergoing  a  review  every  two  months  or  so  [Jir99]. 

MieroSoft’s  Windows  95/98  have  an  “autorun”  feature  on  CD-ROM  drives,  in  whieh  a 
speeial  file  is  eheeked  when  a  CD  is  inserted  and  is  run  aufomafieally  when  if  is  found.  This 
meehanism  is  similar  fo  our  “seripl”  approaeh  of  selling  up  disks  mentioned  in  Seelion  3.3.2. 

1.4  Research  Issues/Contributions 

To  summarize,  Ihese  are  Ihe  eonlribulions  of  Ihis  researeh: 

•  Analyzing  and  evaluating  various  melhods  lo  reduee  syslem  adminislralion  eosl  of  large  stor¬ 
age  syslems 

•  Designing  and  implementing  a  self-mainlaining  slorage  syslem 

•  Several  small  bul  helpful  enhaneemenls  to  Ihe  operating  system  to  make  Ihe  system  easier  to 
manage 

•  A  guide  to  designing  systems  lhal  reeover  quiekly 

•  Calaloging  and  analyzing  Ihe  eomponenl  failures  we  have  seen  over  Iwo  years  Ihe  prototype 
has  been  in  operation 

•  An  evaluation  of  Ihe  reliability  of  our  design  by  simulation  and  exploration  of  Ihe  design 
spaee 

•  Insighls  into  Ihe  relationship  belween  system  availability  and  Ihe  reliability  of  a  single  disk 

•  Insighls  into  Ihe  relationship  belween  system  availability  and  Ihe  frequeney  of  visils  for  repair 

•  Use  of  DAG  to  represenl  system  to  simplify  simulation  of  availability 


•  A  lisl  of  eoneerns  when  building  large-seale  storage  systems  from  eommodily  eomponenls 


10 


Chapter  2 

Motivation 


This  chapter  explains  further  why  reducing  the  burden  on  the  system  administrator  is 
important  for  storage  systems.  The  point  we  would  like  to  emphasize  is  that  the  system  will  be 
cheaper,  safer,  and  more  available  if  it  can  mostly  manage  itself  rather  than  rely  on  being  watched 
by  a  human  operator. 

2.1  Overview 

Most  storage  systems  today  are  ad  hoc  collection  of  components,  growing  over  the  years 
as  the  organization’s  needs  increase.  The  two  most  common  methods  for  protection  against  data 
loss  are  backups  and  redundancy.  I  will  briefly  overview  them  here,  focusing  on  the  system  admin¬ 
istrator’s  point  of  view.  The  type  of  failures  they  protect  the  system  from  is  also  illustrated. 

2.1.1  Backups 

Backups  are  typically  done  to  magnetic  tapes.  Tapes  have  very  low  unit  cost,  moderate 
bandwidth  and  slow  random  access.  These  characteristics  make  them  ideal  for  backups  since  the 
backup  tapes  are  usually  written  in  one  pass  and  almost  never  read  back. 

The  reasoning  behind  doing  backups  are  twofold.  One  is  to  safeguard  against  hardware 
failures  that  result  in  data  loss.  The  other  is  to  help  recover  data  from  human  errors. 

Hardware  Failures 

If  the  storage  system  suffers  a  crash,  the  operator  can  recover  the  filesystem  from  backup 
tapes.  In  some  organizations,  the  range  of  failures  to  safeguard  against  also  includes  catas¬ 
trophic  disasters — the  only  way  to  prevent  losing  all  data  when  there  is  a  fire  in  the  computer 
room  is  to  make  a  copy  of  data  and  send  it  to  a  physically  remote  location.  The  cheapest  way 
to  accomplish  this  kind  of  redundancy  is  to  ship  full  backup  tapes  to  another  site  periodically. 

The  system  administrator  usually  does  not  have  to  spend  much  time  making  backups  after 
the  initial  setup,  although  the  amount  of  daily  tape-changing  that  has  to  be  done  depends  on 
the  characteristic  of  the  tape  backup  system.  For  instance,  a  large  organization  with  a  tape 
library  can  have  fully  automated  daily  backups;  a  smaller  one  with  a  stacker  might  require  an 
operator  to  change  tapes  daily. 


11 


Human  Errors 

Since  few  filesystems  in  existence  today  offer  versioning  support,  even  if  the  storage  system 
functions  exactly  as  designed  without  any  failures,  users  can  still  “lose”  files  because  of  their 
own  mistakes.  By  having  a  backup  done  periodically,  the  system  administrator  can  manually 
recover  at  least  some  version  of  the  lost  file — if  backups  are  done  daily,  it  will  probably  be 
yesterday’s  version. 

Recovering  from  human  errors  could  be  a  much  more  time-consuming  process  for  the  system 
administrator,  as  the  user  might  not  even  know  when  the  file  was  last  written.  Given  infor¬ 
mation  such  as  “1  think  1  deleted  a  file  named  handler .  c  in  this  directory  last  week”,  the 
operator  has  to  go  through  several  tapes  trying  to  find  the  latest  version  of  the  file  in  question. 
The  situation  is  further  complicated  when  the  system  has  incremental  backups,  since  yester¬ 
day’s  version  of  the  file  is  not  necessarily  found  in  yesterday’s  daily  backup  tapes — it  could 
have  been  written  months  ago  and  therefore  only  show  up  in  the  last  full  backup  tape  set. 

Again,  the  amount  of  time  the  operator  has  to  spend  depends  on  the  system.  One  with  a  tape 
library  and  a  good  backup  management  software  might  make  answering  above  requests  a 
one-line  command;  others  might  require  the  operator  to  manually  go  back  and  forth  between 
the  terminal  and  the  tape  unit  and  issue  several  commands  to  find  the  right  version  of  the  files. 

Note  that  even  with  backups,  some  data  loss  is  inevitable.  In  particular,  there  is  no  way 
to  get  back  modifications  done  since  the  last  backup.  Also,  backups  have  no  effect  on  system  avail¬ 
ability  in  an  event  of  hardware  failure  except  that  it  will  help  restart  the  system  quicker.  Basically, 
the  system  will  be  down  until  the  operator  comes  in  and  restores  the  filesystem. 

2.1.2  Redundancy 

There  are  many  ways  to  build  disk  storage  systems  with  some  degree  of  redundancy  to 
prevent  data  loss  when  hardware  failures  occur.  The  most  common  approach,  called  RAID  (Redun¬ 
dant  Array  of  Inexpensive  Disks),  is  a  method  of  building  large  disk  storage  systems  from  smaller 
disks.  There  are  six  levels  of  RAlDs,  0  through  5,  differing  in  redundancy  schemes.  Here  are  brief 
descriptions  of  most  commonly  used  RAID  levels. 

RAID-0 

Level  0  is  simple  striping.  Obviously,  this  is  not  “redundant”  at  all;  it  is  included  in  RAID 
levels  just  as  a  reference  point.  RAlD-0  is  optimal  for  non-essential  data  where  read/write 
speed  is  most  important,  as  this  suffers  from  no  overhead  due  to  redundancy-based  protection 
schemes. 

RAID-1 

Level  1,  or  mirroring,  takes  data  on  one  set  of  disks  and  makes  identical  copies  of  them  on 
another  set.  RAlD-1  can  tolerate  multiple  failures  as  long  as  one  of  the  disks  is  alive  on  all 
mirror  pairs.  RAlD-1  has  a  100%  cost  penalty  as  the  system  can  only  hold  half  the  data  it 
would  without  redundancy.  There  is  also  a  penalty  on  writes,  since  all  data  blocks  have  to  be 
written  on  two  disks  instead  of  one. 

RAID-5 

Level  5,  block-interleaved  parity  without  a  dedicated  parity  disk,  offers  a  good  balance  of 


12 


performance  and  cost.  With  N  disks  in  a  group,  the  cost  penalty  is  {1/N  x  100)%.  Writing 
to  RAID-5  stripes  can  be  very  slow  if  it  involves  random  accesses,  since  each  write  has  to  be 
done  as  two  reads  (original  data,  original  parity)  and  two  writes  (new  data,  new  parity).  The 
performance  suffers  greatly  when  one  of  the  disks  fails,  as  many  operations  will  now  require 
the  system  to  access  {N  —  1)  disks  on  an  Ai-disk  group.  Recovery  is  time-consuming,  as 
contents  of  all  {N  —  1)  disks  have  to  be  read  to  reconstruct  the  failed  disk. 

In  addition  to  these  traditional  RAID  levels,  there  are  redundancy  schemes  that  combine  more  than 
one  RAID  level,  such  as  RAID  O-i-l,  and  those  that  use  a  more  complex  algorithm  to  protect  against 
multiple  disk  losses  in  a  single  redundancy  group,  such  as  P-i-Q  RAID. 

There  are  many  manufacturers  selling  RAID-based  systems  on  the  market.  There  are 
more  than  50  companies  and  100  products  in  the  RAID  Consortium,  and  collectively  they  ship 
billions  of  dollars  of  systems  each  year. 

In  addition  to  hardware  RAID,  some  systems  implement  RAID  in  software.  They  are  in¬ 
herently  slower  than  hardware  RAID  but  require  no  extra  hardware  and  therefore  are  much  cheaper. 
Many  operating  systems  have  software  RAID  support,  either  as  a  standard  feature  or  an  add-on 
module,  sometimes  supplied  by  a  third-party  vendor. 

In  summary,  RAID  will  safeguard  against  certain  type  of  hardware  failures,  in  particular, 
a  certain  number  of  disk  failures,  but  not  others.  They  do  not  provide  protection  against  human 
errors  in  ways  backups  do.  RAID  systems  have  good  availability  as  long  as  the  failures  are  within 
their  design  specs. 

Administering  RAID  systems  are  generally  easier  but  human  operators  still  need  to  take 
steps  to  help  the  system  recover.  Depending  on  the  design  of  the  system  and  user  interface,  manag¬ 
ing  spare  disks  can  be  a  time-consuming  task. 

2.2  The  Impact  of  System  Administration 

There  are  several  reasons  why  system  administration  is  the  key  to  the  success  of  a  large 
storage  system.  One  is  that  the  cost  of  system  administration  usually  dominates  the  budget  of  storage 
systems.  Without  cutting  down  the  system  administration  cost,  it  will  not  be  possible  to  reduce  the 
overall  cost  of  storage  systems  significantly.  Another  is  that  the  impact  of  bad  or  nonexistent  system 
administration  for  storage  systems  will  negatively  affect  the  organization  in  many  ways. 

2.2.1  Comparison  with  Hardware  Cost 

The  cost  of  computer  hardware  has  been  dropping  steadily  for  the  past  few  decades.  How¬ 
ever,  the  cost  of  administering  storage  systems  has  remained  high,  as  growth  in  demands  for  storage 
has  exceeded  the  capacity  increase  of  a  single  disk,  making  the  systems  more  complex  than  before. 
For  instance,  as  mentioned  earlier,  a  1993  study  by  Strategic  Research  Corp.  shows  that  the  average 
annual  cost  of  system  administration  of  storage  systems  is  almost  three  times  that  of  the  cost  of 
system  hardware  itself  [SR93].  Amdahl’s  law  suggests  that,  without  significantly  reducing  the  cost 
of  system  administration,  any  benefit  of  using  commodity  hardware  to  build  storage  systems  will 
completely  diminish  [HP96b].  For  instance,  if  the  system  administration  cost  is  three  times  that 
of  hardware  cost,  halving  the  hardware  price  will  only  save  12.5%  of  the  overall  cost;  even  if  the 
hardware  price  can  be  reduced  to  zero,  the  savings  will  only  be  25%  for  the  overall  cost. 


13 


Note  that  their  study  is  already  targeting  fairly  expensive  storage  systems,  as  most  or¬ 
ganizations  surveyed  were  using  large  fileservers  or  hardware  RAID  boxes.  For  low-eost  storage 
systems  sueh  as  ours,  the  ratio  of  system  administration  against  hardware  eost  is  even  higher. 

2.2.2  The  Risk  Factor 

The  importanee  of  system  administration  transeends  that  of  the  salaries  of  the  system 
administrators,  as  the  system  ean  do  only  so  mueh  to  safeguard  against  operator  errors.  In  general, 
it  is  very  easy  for  an  operator  to  make  a  mistake  that  will  eause  a  lot  of  data  to  be  lost  or  the  system 
to  be  unavailable  for  an  extended  period  of  time.  The  “routine”  part  of  the  job  of  an  operator, 
the  everyday  work  that  doesn’t  involve  repairs  and  other  non-standard  operations,  is  generally  not 
very  dangerous — it  is  possible  to  safeguard  against  typieal  mistakes  made  during  those  times  fairly 
effeetively  by  normal  means  sueh  as  baekups.  However,  the  “emergeney”  work  that  needs  to  be 
done  by  system  administrators  has  a  mueh  higher  risk  than  normal  operations  for  several  reasons. 

Vulnerability 

The  system  is  already  in  a  vulnerable  state  when  it  is  being  repaired.  For  instanee,  eonsider  a 
RAID-5  volume  with  a  failed  disk.  If  the  operator  replaees  one  of  the  good  disks  instead  of 
the  faulty  one  and  then  runs  the  reeonstruetion  seript,  the  entire  dataset  eould  be  lost.  Another 
example  is  reeovering  a  system  from  baekup  tapes.  If  the  operator  writes  to  the  tapes  instead 
of  reading  from  them,  all  data  in  them  will  be  lost  irreeoverably. 

These  may  seem  like  unlikely  events,  but  stranger  things  happened  in  real  life.  For  instanee, 
in  our  fully  mirrored  system,  I  managed  to  lose  an  entire  dataset  by  replaeing  the  wrong  disk 
during  a  repair  and  then  running  the  eopy  in  the  wrong  direetion.  Moreover,  the  point  is  not 
about  how  likely  it  is  for  these  events  to  happen — it  is  that  the  system  is  in  a  vulnerable  state, 
and  a  simple  error  by  the  human  operator  ean  eause  a  lot  more  damage  when  the  system  is 
down  to  one  last  eopy  of  good  data. 

Familiarity 

System  administrators  don’t  get  to  praetiee  doing  these  out-of-routine  work  beeause  of  the 
rare  nature  of  the  problems.  Operators  either  follow  written  manuals  or  improvise  along  the 
way,  and  both  of  these  methods  are  prone  to  mistakes  if  the  person  is  not  familiar  with  the 
aetions. 

Repair  Hours 

A  system  administrator  who  has  just  been  ealled  in  by  a  pager  in  the  middle  of  the  night  is 
more  likely  to  make  mistakes  that  one  that  eonduets  repairs  only  during  regular  work  hours. 
This  laek  of  sleep  was  one  of  the  faetors  that  eontributed  to  the  egregious  disk  eopying  mistake 
I  mentioned  earlier. 

There  have  been  many  researeh  in  the  field  of  physiology  and  psyehology  on  the  subjeet 
of  sleep  deprivation  and  human  performanee.  In  Mullaney  et  al.  [MKF83]  and  Angus  and 
Heslegrave  [AH85],  researehers  eonfirm  our  intuition,  that  people  are  less  likely  to  perform 
tasks  eorreetly  and  effieiently  after  a  long  period  of  work  and  also  during  normal  sleeping 
hours. 


14 


Hockey  et  al.  [HWS98]  conducted  a  series  of  tests  in  which  operators  were  asked  to  monitor 
the  state  of  a  complex  machinery  control  system  and  make  necessary  adjustments  to  keep  the 
system  functional.  Other  than  the  expected  result  in  which  operators  without  enough  sleep 
did  not  perform  as  well,  they  found  an  interesting  tendency,  in  which  sleep-deprived  subjects 
tended  to  exhibit  reactive  behavior  rather  than  perform  preventive,  model-based  strategy  in 
dealing  with  problems.  To  quote  a  sentence  from  their  conclusion,  “Operators  make  more 
frequent  interventions  in  order  to  stabilize  the  system  when  faults  occur,  sometimes  without 
a  clear  idea  of  what  is  wrong.”  It  is  worth  pointing  out  that  this  kind  of  careless  reflective 
actions  is  very  dangerous  on  a  storage  system  in  a  vulnerable  state,  and  can  easily  lead  to  data 
loss. 

In  another  study,  Fairclough  and  Graham  [FG99]  found  that  sleep  deprivation  is  similar  to 
alcohol  consumption  in  terms  of  having  a  detrimental  effect  on  subjects’  driving  ability,  al¬ 
though  one  notable  difference  was  that  sober  but  sleep-deprived  subjects  knew  that  their  per¬ 
formance  was  suffering. 

Mental  Pressure 

The  operator  will  be  under  mental  pressure  to  repair  a  system  if  it  is  already  down  and  it  has 
to  be  brought  up  as  soon  as  possible.  Under  such  mental  stress,  people  are  more  likely  to 
make  mistakes  they  won’t  make  without  pressure. 

Psychologists  have  been  studying  the  effects  of  human  presences  for  over  a  century.  The  ear¬ 
liest  experiment  in  this  topic  was  done  by  Triplett  in  1898.  Studies  in  the  first  half  of  the  20th 
century  offered  conflicting  results  as  to  whether  human  presences  help  or  hurt  performances. 
In  1965,  Zajonc  attempted  to  classify  experiments  to  explain  the  differences — his  conclu¬ 
sion  was  that  simple,  repetitive  and  well-practices  tasks  are  helped  by  the  presence  of  other 
humans,  while  more  complex  tasks  that  require  thinking  and  experimenting  are  inhibited — 
the  person  was  found  to  be  more  prone  to  errors  when  there  were  others  observing  the  task 
[Zaj66]. 

Since  Zajonc’s  paper,  there  have  been  many  attempts  to  verify  or  disprove  his  arguments,  or 
offer  a  different  explanation  for  the  disparity;  for  a  summary,  see  Paulus’  book  [Pau89].  They 
all  agree  that  complex  tasks  are  disturbed  by  the  presence  of  others.  We  can  deduce  from 
these  results  that  the  knowledge  that  there  are  users  who  are  patiently  waiting  for  the  system 
to  be  repaired  can  make  the  operator  to  make  mistakes  the  person  otherwise  won’t  make. 

The  problem  illustrated  here  is  that  the  system  administrator  is  more  likely  to  make  errors 
when  they  can  be  least  afforded,  which  is  obviously  not  a  desirable  situation.  We  think  the  self- 
maintaining  system  will  improve  the  situation,  and  thereby  improve  availability. 

2.2.3  Availability 

A  system  that  relies  on  a  human  operator  to  function  continuously  will  not  be  as  available 
as  one  that  maintains  itself.  A  system  without  enough  redundancy,  even  with  an  operator  with  an 
around-the-clock  pager  that  can  be  called  in  even  during  the  middle  of  the  night,  cannot  be  expected 
to  recover  in  a  few  minutes,  let  alone  a  few  seconds  like  our  design.  It  usually  takes  much  longer 
for  a  human  operators  to  repair  problems  than  machines  do  [GS91]. 


15 


2.3  Our  Approach 

I  am  proposing  a  “self-maintaining”  approach  to  system  administration,  in  which  the  stor¬ 
age  system  maintains  itself  with  only  minimal  help  from  a  human  operator.  Instead  of  having 
someone  constantly  on  call  look  after  the  system,  our  system  is  designed  to  mask  problems  and  run 
continuously  even  when  there  are  several  independent  component  failures.  The  system  administra¬ 
tor’s  job  is  only  to  come  in  for  scheduled  visits,  for  instance,  once  per  week,  to  replace  the  failed 
components. 

As  administration  is  most  of  the  cost  in  running  a  large  server,  such  a  reduction  will  also 
save  a  great  deal  of  money.  In  1999,  a  system  administrator’s  salary  is  around  $100,000  per  year, 
plus  benefits  adding  another  $20,000  per  year.  Suppose  there  is  an  organization  that  hires  such  a 
person  to  administer  a  cluster  of  computers.  If  this  person  can  work  part-time,  by  only  coming  to 
work  once  per  week,  an  organization  can  reduce  the  salary  to  1/5,  or  save  $96,000  per  year. 

If  the  system  administration  cost  was  three  times  that  of  storage  hardware  cost  as  was 
found  in  Strategic  Research’s  study,  and  we  can  reduce  the  system  administration  cost  to  1/5,  then 
we  can  calculate  the  the  overall  storage  system  cost  reduction  to  be  60%  using  the  following  equa¬ 
tions: 


Costjriaintenance 

^^^^original 

Costggif —jriaintaining 


Cost 


s  el  f —maintaining 
Costoriginal 


^^^^hardware  ^  ^ 

^^^^hardware  T  Cost^aintenance 

C^^^hardware  ^  4 

^^^^hardware  T  C^^^maintenance / ^ 

^^^^hardware  T  ^^^^hardware  ^  3/5 
Co^^hardware  ^  8/5 
^^^^hardware  ^  8/5 

Costhardware  ^  4 

2/5 


Our  system  accomplishes  this  goal  of  self-maintenance  by  a  combination  of  redundancy 
and  design.  The  next  chapter  describes  the  system  in  detail. 

2.4  Summary 

Reducing  the  burden  on  the  system  administrator  is  important  for  storage  systems.  The 
system  will  be  cheaper,  safer,  and  more  available  if  it  can  manage  itself  rather  than  rely  on  being 
watched  by  a  human  operator.  I  am  proposing  a  “self-maintaining”  approach  to  system  administra¬ 
tion  to  attain  that  goal,  as  well  as  reducing  the  burden  on  the  human  operator  by  other  means. 


16 


Chapter  3 

The  System 


This  chapter  illustrates  the  system  on  which  my  research  was  conducted.  The  application, 
as  well  as  the  hardware  and  software  architecture  of  the  system,  are  described  here. 

3.1  The  Application 

The  main  application  for  the  Tertiary  Disk  prototype  is  an  image  database  holding  pictures 
of  objects  of  art  [TAP+98].  The  “Thinker”  site  (http :  //www .  thinker .  org/),  run  by  the  Fine 
Arts  Museums  of  San  Francisco,  has  been  providing  access  to  art  images  through  the  Internet  since 
October  1996.  They  implemented  a  searchable  index  of  their  art  objects,  through  which  the  user  can 
query  their  database  of  over  70,000  images  for  keywords,  such  as  artist  name,  title  and  description. 
The  user  is  presented  with  a  page  of  thumbnails  of  images  that  fit  the  search  criteria,  and  by  clicking 
on  the  thumbnails  they  can  view  larger  versions  of  the  images.  The  largest  images  they  provide  are 
about  500  pixels  on  one  side,  many  of  them  converted  from  color  to  gray-scale  JPEG  to  save  space. 
The  original  files  were  much  larger,  up  to  3,096  x  2,048  pixels,  but  due  to  disk  space  constraints 
and  lack  of  a  way  to  adequately  present  them  to  users  over  the  Internet,  they  decided  to  only  provide 
relatively  small  images. 

3.1.1  Problems 

Understandably,  one  of  the  most  common  complaints  from  their  users  was  “can’t  you 
make  the  pictures  bigger?”  We  provide  disk  space  for  those  larger  versions  of  the  images.  There 
were  several  reasons  why  it  was  not  possible  for  the  museum  to  provide  larger  versions  of  the  images 
initially. 

Disk  Space  The  full-size  version  of  a  color  JPEG  averages  a  little  less  than  1MB  per  image.  Ac¬ 
cording  to  advertisements  on  an  August  1996  edition  of  MicroTimes,  the  largest  3.5-inch 
drives  available  on  the  market  at  that  time  were  around  4.3GB  and  their  prices  ranged  from 
$800  to  $1,000.  To  hold  70,000  of  1MB  JPEG  images,  they  would  have  had  to  build  a  system 
with  20  of  these  drives,  which  would  have  cost  at  least  $16,000  for  raw  disks  only.  The  people 
at  the  museum  neither  had  the  budget  nor  expertise  to  maintain  a  storage  system  of  that  scale. 
The  TIEE  versions,  which  have  higher  quality  than  JPEGs,  average  about  12MB  per  image. 
To  hold  TIEE  versions  of  all  images,  the  space  requirement  will  balloon  to  840GB. 


17 


User’s  Network  Connection  Many  people  eonneet  to  the  Internet  using  modems.  A  standard  eon- 
neetion  using  a  56kbps  modem  ean  transfer  at  most  4KB  per  seeond.  It  takes  over  four 
minutes  to  download  a  1MB  JPEG  and  almost  an  hour  for  a  12MB  TIFF. 

Memory  Usage  JPEG  is  a  eompressed  image  format;  when  uneompressed  for  display,  they  will 
expand  to  about  the  same  size  as  a  TIFF  image.  In  other  words,  there  may  only  be  1MB 
transferred  over  the  network  link,  but  every  full-size  image  the  user  displays  requires  12MB 
of  loeal  memory  just  for  raw  image  data.  A  few  of  these  images  will  quiekly  expand  the  size 
of  the  browser  to  100MB  or  more,  making  them  unaeeeptably  slow. 

Screen  Size  There  is  no  eomputer  sereen  that  is  large  enough  to  display  an  image  3,000  by  2,000 
pixels,  so  it  is  not  feasible  to  just  dump  the  entire  image  to  the  user’s  web  browser  even  if  the 
user  has  enough  memory  and  a  fast  network  eonneetion. 

3.1.2  GridPix 

To  solve  the  problems  mentioned  in  the  previous  seetion,  I  wrote  a  web-based  image  view¬ 
ing  system  ealled  “GridPix”.  GridPix  uses  tiled,  layered  JPEG  images  with  multiple  resolution 
levels  and  simple  HTMF  to  aehieve  a  “zooming”  effeet  [Asa99]. 

The  GridPix  File  Format 

Figure  3.1  shows  a  three-layer  GridPix  file.  Tiles  1  and  2  form  the  smallest  layer;  tiles 
3  through  8  are  the  middle  layer,  and  the  largest  layer  eonsist  of  tiles  9  through  23. 

Figure  3.2  shows  the  interaetion  between  the  elient  and  the  server.  When  the  elient  re¬ 
quests  an  image,  a  CGI  seript  (mkhtml .  cgi)  returns  an  HTMF  page  deseribing  the  page  layout. 
The  individual  image  tiles  are  retrieved  by  a  separate  CGI  seript  (gettile  .  cgi).  GridPix  is 
eompletely  HTMF-based,  so  any  graphieal  web  browser  ean  funetion  as  a  elient. 

One  important  eharaeteristie  of  the  GridPix  server  is  that  it  requires  very  little  proeess- 
ing  eyeles  during  runtime.  The  images  are  already  divided  up  into  tiles  by  the  TIFF-to-GRiDPlx 
eonverter,  so  the  numerous  ealls  to  gettile  .  cgi  only  require  two  aeeesses  to  the  file;  fo  gef  fhe 
offsel  and  size  of  fhe  file  from  fhe  header,  and  fo  refrieve  fhe  file  ilself.  In  parfieular,  fhere  are  no 
JPEG  eneoding/deeoding  required  on  fhe  server  side.  Moreover,  fhe  GridPix  files  are  very  eom- 
paef,  so  after  a  few  aeeesses,  mosf  of  fhe  requesfs  will  be  handled  by  fhe  server’s  on-memory  disk 
eaehe,  nof  fhe  disk  surfaee. 

3.1.3  Status 

The  site  opened  fo  fhe  publie  on  Mareh  2,  1998  wifh  abouf  20,000  images  [TAP+98].  We 
eurrenfly  have  over  70,000  images  available,  oeeupying  abouf  2.5TB  of  storage  spaee  when  fully 
mirrored.  Eaeh  images  are  sfored  in  several  differenl  formals,  from  fhe  original  PhofoCDs,  whieh 
are  abouf  5MB  per  image,  fo  human-proeessed  TIFFs,  which  average  abouf  12MB  per  image,  fo 
GridPix,  which  are  abouf  1 .2MB  per  image. 

The  reason  why  we  couldn’f  initially  provide  all  fhe  images  as  converted  from  PhofoCDs 
is  because  fhe  phofographs  are  nof  of  good  enough  qualify  fo  be  presenfable,  and  require  manual 
work  fo  crop,  reorienf  and  color  correcf  fhem  before  fhey  can  be  presented  fo  visifors.  According  fo 


struct  gridheader 


Figure  3.1:  Three-layer  GridPix  image 


Client 


Figure  3.2:  GridPix  interface 


19 


the  people  in  the  museum,  the  response  from  the  publie  has  been  very  favorable.  Also,  we  have  yet 
to  experienee  any  down-time  other  than  eampus-wide  network  or  power  failures  despite  individual 
maehines  erashing  or  being  rebooted  many  times. 

I  have  also  gotten  a  few  inquiries  from  people  who  are  interested  in  using  GridPix  to 
present  their  own  images. 

3.2  Tertiary  Disk  Architecture 

The  Tertiary  Disk  group  has  eonstrueted  a  prototype  disk  storage  system,  on  whieh  I  have 
eondueted  this  researeh.  The  prototype’s  total  eapaeity  is  3.2  terabytes.  The  system  oeeupies  seven 
raeks,  eaeh  7  foot  tall  and  19  inehes  wide.  Figure  3.3  shows  the  system  eonfiguration.  Here  are 
some  highlights: 

•  20  PCs  (200MHz  Pentium  Pro  with  96MB  of  memory  eaeh)  as  disk  servers 

•  364  8.4  gigabyte  IBM  DCHS  7,200RPM  Ultra-Wide  SCSI  disks  with  SCA  (Single  Conneetor 
Attaeh)  eonneetors 

•  52  Trimm  Teehnologies  model  381  8-disk  SCA  enelosures  with  serial  interfaee 

•  44  Adaptee  3940UW  twin-ehannel  Ultra-Wide  SCSI  adapters 

•  2  16-port  (4  X  4)  100Mbps  fast  Ethernet  switehes 

•  4  24-port  serial  terminal  servers  for  PC  eonsoles  and  disk  enelosure  interfaees 

•  PCs  run  FreeBSD  operating  system,  with  minor  modiheations 

•  Double-ending  of  SCSI  buses  for  high  availability 

•  Redundant  front-ends  use  HTTP  redireet  to  route  user  requests 

•  6  uninterruptible  power  supply  (UPS)  units 

•  Remotely-eontrollable  power  switehes  on  all  PCs 

•  2  PCs  (133MHz  Pentium)  as  HTTP  frontends 

•  2  PCs  (200MHz  Pentium  Pro)  as  infrastrueture  servers 

There  are  two  different  eonfigurations  with  regards  to  PCs  and  disks.  Four  PCs  are  eonfig- 
ured  in  a  “disk-heavy”  eonfiguration,  ealled  A-nodes,  with  70  disks  per  PCs.  The  remaining  16  are 
in  a  “CPU-heavy”  eonfiguration,  B-nodes,  with  32  disks  per  PCs.  SCSI  buses  on  B-nodes  are  run¬ 
ning  at  20MHz  (Ultra-SCSI  speed),  while  the  A-nodes  are  running  at  lOMHz  (Fast-SCSI  speed). 
The  reason  for  this  restrietion  is  beeause  the  longer  eables  and  high  eapaeitanee  from  having  the 
maximum  number  of  deviees  per  bus  on  A-nodes  made  them  unstable  with  20MHz  operation. 

Serial  ports  aet  as  system  eonsoles  for  all  but  one  of  the  eomputers.  The  serial  ports  enable 
the  eomputers’  eonsoles  to  be  aeeessible  to  automated  maintenanee  seripts.  One  of  the  frontend 
maehines  has  a  standard  monitor  and  keyboard  to  provide  aeeess  to  the  system  when  the  operator  is 
in  the  maehine  room. 


20 


A-node 

(70  disks  /  pair) 
X  2  pairs 


B>node 

(32  disks  /  pair) 
X  8  pairs 


Figure  3.3:  Tertiary  Disk  Final  Prototype 


21 


3.2.1  Commodity  Components 

The  most  important  feature  is  that  this  whole  system  is  built  only  from  commodity,  off- 
the-shelf,  components.  The  use  of  commodity  components  lowers  the  cost  of  storage  by  factors  of 
two  to  four  compared  to  standard  RAID  boxes.  At  the  time  of  construction  of  the  system  in  the 
summer  of  1997,  large  RAID  arrays  cost  about  60  cents  per  megabyte;  our  system  cost  about  20 
cents  per  megabyte  using  street  prices  of  components. 

3.2.2  Redundancy 

In  designing  the  TD  prototype,  we  have  taken  care  to  ensure  it  does  not  have  any  single 
point  of  failure. 

Power 

Multiple  UPS  units  provide  power  for  the  system.  There  are  power  rails  on  either  side  of  the 
racks,  connected  to  different  UPS  units.  UPS  units  also  provide  10  minutes  of  standby  power 
to  survive  temporary  glitches  in  the  main  power  line. 

Each  enclosure  has  two  power  supplies.  They  are  connected  to  power  rails  on  opposite  sides 
of  the  racks,  and  thus  to  different  UPS  units.  Each  power  supply  has  a  built-in  fan,  and  there 
is  a  third  fan  in  the  enclosure  to  ensure  that  the  airflow  around  the  disks  will  not  be  reduced 
to  half  when  one  power  supply  fails. 

Double-Ending  of  SCSI  Chains 

Double-ended  PC  pairs  are  connected  to  different  power  rails.  They  are  also  connected  to 
different  network  and  serial  switches.  Termination  on  the  SCSI  bus  is  supplied  by  the  SCSI 
adapters.  Eigure  3.4  describes  double-ending. 


Eigure  3.4:  Double-Ending  of  SCSI  disks 

Our  initial  design  used  feed-through  terminators  to  provide  termination  so  the  bus  integrity  is 
not  compromised  even  if  one  of  the  PCs  lose  power  (Eigure  3.5).  However,  the  final  prototype 
was  built  without  the  external  terminators.  Terminators  were  omitted  because  of  physical  con¬ 
straints  of  the  particular  parts  we  purchased — the  terminators  could  not  be  installed  on  some 
of  the  connectors  on  the  twin-channel  SCSI  adapters.  Also,  the  SCSI  adapters  had  special 
jumpers  which  would  enable  constant  termination,  i.e.,  the  adapter  will  supply  termination  to 
the  bus  even  when  the  machine  is  turned  off. 


22 


Terminator 


Terminator 


□  □ 


SCSI  Adapter  SCSI  Adapter 

PC 

1  ^  ^  Disks  ^  1 

PC 

Figure  3.5:  Double-Ending  with  feed-through  terminators 


Having  feed-through  terminators  affects  the  system  in  two  ways.  One  is  that  it  is  possible 
to  remove  a  PC  while  still  having  the  disks  being  accessed  by  its  double-ended  pair.  How¬ 
ever,  this  is  not  possible  in  our  setup,  since  the  SCSI  components  are  not  properly  shielded 
from  power  and  signal  glitches,  and  disconnecting  a  cable  can  cause  incorrect  functioning  or 
component  damage.  We  plan  to  do  any  repair  work  only  after  powering  off  both  machines  as 
well  as  all  the  disk  enclosures  involved,  so  having  feed-through  terminators  will  not  help  in 
this  way.  We  have  sufficient  redundancy  so  that  powering  down  these  two  PCs  will  not  deny 
access  to  images. 

The  other  effect  is  that  the  disks  on  the  SCSI  bus  are  still  accessible  when  one  of  the  SCSI 
adapters  have  failed  to  the  point  that  it  is  no  longer  able  to  supply  proper  termination.  Similar 
situations  can  arise  if  the  SCSI  adapter  does  not  have  a  consistent  termination  setting,  a  PC 
power  supply,  power  rail  or  UPS  unit  failure,  or  the  machine  being  turned  off  by  the  operator. 
Given  the  failure  frequencies  we  have  observed,  these  four  kinds  of  failures  are  very  unlikely. 
The  last  possibility — one  of  the  two  PCs  being  powered  off — ^has  happened  only  when  there 
was  a  part  replacement,  so  is  already  discussed  in  the  previous  paragraph. 

Data  Layout 

Most  data  are  mirrored.  Stable  data,  not  necessary  for  the  system’s  day-to-day  operation,  are 
backed  up  by  CD-ROMs. 

The  rightmost  column  of  Figure  3.6  shows  how  the  data  are  mirrored.  Each  of  the  small  boxes 
indicate  disk  enclosures  holding  7  or  8  SCSI  disks;  for  instance,  the  mO-ml  pair  has  10  disk 
enclosures  and  the  m4-m5  pair  has  4. 

The  mO-ml  pair  and  m2 -m3  pair,  the  A-nodes,  back  up  each  other,  since  they  have  more 
disks  than  the  other  16  disk  servers  with  the  CPU-heavy  conhguration.  The  remaining  pairs, 
the  B-nodes  m4-m5  through  ml  8 -ml  9  are  also  matched  up  with  pairs  with  the  same  number 
of  disks. 

3.2.3  Topology 

This  section  explains  the  topology  of  power  and  Ethernet  connections  in  our  system. 

Power  Switches 

Figure  3.6  also  shows  how  the  PCs  are  connected  to  the  remotely  controllable  power  switches. 


23 


To  ensure  that  a  power  switeh  failure  will  not  take  down  both  PCs  of  a  double-ending  pair, 
the  maehines  are  hooked  up  in  a  erisserossing  pattern. 

As  a  referenee,  Figure  3.7  shows  how  the  PCs  would  be  eonneeted  if  we  did  not  have  feed¬ 
through  terminators  or  SCSI  adapters  with  eonstant  termination  jumpers.  Sinee  a  power 
switeh  failure  would  take  down  all  SCSI  buses  eonneeted  to  the  PCs  on  that  switeh,  it  is 
pointless  to  eonneet  the  PCs  that  form  a  double-ended  pair  to  different  switehes.  In  this  ease, 
the  only  foeus  should  be  to  make  sure  that  the  two  double-ended  pairs  that  hold  mirrored 
eopies  of  the  same  data  (e.g.,  m4-m5  and  ml  2 -ml  3)  are  not  eonneeted  to  the  same  power 
switeh. 

Ethernet  Switches 

Figure  3.8  shows  how  the  PCs  are  eonneeted  to  our  two  100Mbps  Ethernet  switehes.  Ours 
are  4  X  4  switehes,  meaning  the  16  ports  are  arranged  in  four  groups  of  four  ports  eaeh,  with 
the  ports  in  eaeh  group  sharing  a  eommon  bus  and  a  switeh  at  the  base  to  route  between 
eaeh  groups.  For  instanee,  with  the  topology  in  Figure  3.8,  mO,  m4  and  m6  ean  send  or 
reeeive  a  total  of  100Mbps  of  data,  while  m4,  m8,  ml  2  and  ml  6  ean  all  transfer  at  100Mbps 
simultaneously. 

The  eonneetions  are  designed  to  balanee  the  load  in  ease  one  of  the  PCs  of  the  double-ended 
pair  fails,  or  one  of  the  mirrored  eopies  beeome  unavailable. 

3.2.4  Disk  Interface 

There  were  several  disk  interfaees  to  ehoose  from  at  the  time  we  built  the  prototype.  This 
seetion  briefly  eompares  them  and  explains  why  we  deeided  to  use  SCSI  to  build  our  system. 

IDE 

The  most  basie  and  eheapest  of  the  alternatives,  IDE,  is  very  popular  as  the  system  disk  for 
PC -based  systems.  However,  it  is  not  suffieient  for  server  systems  sueh  as  ours.  Eor  instanee, 
at  the  time  we  built  the  prototype,  most  IDE  systems  did  not  support  bus-mastered  DMA 
(direet-memory  aeeess),  although  today’s  IDE  systems  support  it.  IDE  only  allows  two  disks 
per  bus,  whieh  limits  the  design  spaee  sinee  the  possible  maximum  number  of  disks  per  PC  is 
mueh  lower  than  all  other  alternatives.  IDE  also  does  not  allow  multiple  initiators  on  a  single 
bus,  making  it  infeasible  to  use  with  double-ending. 


SCSI 

Small  Computer  Systems  Interfaee  (SCSI)  is  a  parallel  bus  protoeol  that  is  widely  used  for 
disk  servers  due  to  its  availability  on  the  market  and  flexibility.  SCSI  disks  usually  eost 
50-80%  more  than  their  IDE  eounterparts  but  are  eheaper  than  EibreChannel  disks.  The 
maximum  number  of  deviees  (ineluding  the  SCSI  host  adapter)  is  equal  to  the  bus  width,  so 
you  ean  have  8  deviees  on  an  8-bit  (“narrow”)  bus  and  16  deviees  on  a  16-bit  (“wide”)  bus. 
The  SCSI  standard  allows  multiple  initiators  on  a  single  bus. 

SCSI  has  been  doubling  the  transfer  rate  every  three  years,  from  5MHz  to  lOMHz  to  20MHz 
to  40MHz  to  80MHz  in  bus  frequeney  and  from  8  bits  to  16  bits  in  bus  width.  The  fastest  SCSI 
bus  today  runs  at  160MB/s,  with  80MHz  bus  frequeney  on  a  16-bit  bus.  However,  the  higher 
eloek  rates  has  put  strieter  demands  on  eabling  and  eleetrieal  eharaeteristies  of  eomponents. 


24 


powerO 


power  1 


power2 


mO 

ml 


00000 X 


m2 


m3 

tarkin 

ackbar 

stampede 

leia 


m4 

m5 

m6 

ml 

m8 

m9 

mlO 

mil 

ml2 

ml3 

ml4 

ml5 

ml6 

ml7 

ml8 

ml9 


00X 
00\  \ 


Figure  3.6:  Power  connection  and  mirroring  topology  on  a  system  with  constant  termination 


25 


powerO 


power  1 


power2 


mO 

ml 

m2 

m3 

00000 

tarkin 

ackbar 

stampede 

leia 

007 

m5 

m6 

007  0 

ml 

m8 

007  0 

m9 

mlO 

007  '0 

mil 

y 

ml2 

ml3 

UJLiJ 

ml4 

000  , 

ml5 

ml6 

ml7 

hJLiJ 

ml8 

ml9 

000 

Figure  3.7:  Power  connection  and  mirroring  topology  on  a  system  without  constant  termination 


26 


27 


For  example,  the  maximum  eable  length  for  20MHz  operation,  the  maximum  possible  for  our 
disks  and  adapters,  on  a  traditional  “single-ended”  SCSI  bus,  is  1.5  meters. 

The  eurrent  standard  in  the  market  is  “differential”  SCSI,  whieh  allows  a  mueh  longer  eable 
by  pairing  signal  and  ground  pins  instead  of  sharing  a  eommon  ground  signal.  The  maximum 
eable  length  is  12  meters  for  20MHz  operation  using  differential  SCSI.  Several  external  SCSI 
RAID  boxes  used  differential  SCSI  to  eonneet  the  RAID  ehassis  to  the  host  for  this  reason 
even  before  differential  SCSI  was  of  widespread  use. 

The  new  serial  interfaees,  deseribed  in  the  next  two  paragraphs,  also  mitigate  the  eable-length 
problem. 


SSA 

Serial  Storage  Arehiteeture  (SSA)  is  a  point-to-point  serial  interfaee.  SSA  allows  up  to 
20MB/S  to  be  transferred  in  both  direetions  at  any  given  link,  and  supports  up  to  126  de- 
viees  on  an  SSA  segment.  By  eonneeting  deviees  in  a  eirele,  it  is  possible  to  transfer  up  to 
SOMB/s  total  (40MB/S  read,  40MB/s  write)  on  one  host  adapter  by  moving  data  in  both  di¬ 
reetions  simultaneously.  SSA  has  the  eharaeteristie  that  a  single  link  failure  in  a  loop  will  not 
eompromise  the  integrity  of  the  entire  segment,  as  the  loop  will  simply  beeome  a  string  when 
the  failed  link  is  logieally  diseonneeted  by  the  SSA  eireuits  on  the  deviees  on  both  sides  of 
the  failed  link. 

FC-AL 

Another  standard,  FibreChannel-Arbitrated  Loop  (FC-AL)  is  a  loop-based  serial  interfaee. 
FC-AL  allows  up  to  127  deviees  on  a  single  loop.  There  are  several  versions  of  the  interfaee, 
varying  in  eost  and  performanee.  The  fastest  implementation  ean  transfer  up  to  1 .06  Gbits  per 
seeond,  or  about  lOOMB/s.  One  of  FC-AL’s  weaknesses  is  that  a  single  link  failure  will  eause 
the  entire  loop  to  eease  funetioning.  This  problem  ean  be  avoided  by  integrating  two  loops, 
one  in  eaeh  direetion,  into  one,  or  by  using  a  repeater  hub  with  automated  failure  isolation 
features. 

In  1998,  the  SSA  and  FC-AL  groups  have  agreed  to  merge  their  standards.  The  new 
standard  is  mostly  FC-AL  based  but  also  ineorporates  SSA’s  link-to-link  eharaeteristies  for  safety. 
The  new  SCSI-3  standard  will  run  on  several  physieal  transport  layers,  ineluding  SSA/FC-AL. 

At  the  time  we  were  testing  equipment  for  our  prototype,  IDE  and  SCSI  were  the  only 
interfaees  for  whieh  deviees  were  widely  available  and  affordable.  Between  those  two,  SCSI  was 
the  easy  ehoiee  due  to  teehnieal  limitations  of  IDE.  Also,  SSA  and  EC-AE  standards  were  still 
highly  fluid  at  that  time. 

3.3  Software  Architecture 

There  are  several  aspeets  of  software  arehiteeture  that  need  to  be  diseussed.  Here  I  will 
explain  how  we  handle  user  requests,  and  what  modifieations  we  made  to  the  operating  system  to 
support  our  system. 


28 


3.3.1  Handling  User  Requests 

Our  main  application,  as  described  in  section  3.1,  is  a  web  server  for  fine  art  images. 
Figure  3.9  shows  how  user  requests  are  handled.  When  a  user  at  the  museum’s  site  selects  an  option 
to  display  the  GridPix  version  of  an  image,  the  request  first  arrives  to  a  frontend  machine.  The 
frontend  will  then  look  up  a  table  and  return  an  HTTP  redirect  message  to  the  user’s  client,  which 
subsequently  reconnects  to  the  backend  machine  holding  the  image.  From  that  point  on,  the  client 
interacts  directly  with  the  backend  machine  until  the  user  has  finished  examining  fhaf  image. 


Figure  3.9:  Handling  user  requesfs 


Masking  Failures 

There  are  several  levels  of  masking  we  can  do  for  failures.  In  one  exfreme,  a  sysfem  can 
have  non- volatile  RAMs  holding  sfafe  informafion  for  open  TCP  connections  so  a  machine  crashing 
and  rebooting  will  nof  cause  any  connection  fo  be  losf.  The  ofher  exfreme,  like  NFS,  is  fo  have  a 
sfafeless  server  wifh  clienfs  refrying  until  fhe  server  replies.  This  model  causes  fhe  clienf  fo  lock 
up  until  fhe  server  recovers,  buf  will  nof  lose  any  requesfs.  If  is  even  possible  fo  replace  a  failed 
server  wifh  anofher  machine  as  long  as  fhe  device  IDs  and  i-node  numbers  of  fhe  filesysfems  sfay 
fhe  same — from  fhe  clienf,  if  will  jusf  look  like  an  exfremely  long  nefwork  failure  or  server  reboof. 

Those  fwo  models  are  required  in  a  local-area  nefwork  environmenf  where  correcfness  of 
response  is  mosf  imporfanf.  Our  premise,  as  menfioned  in  Section  1.1.3,  is  fhaf  since  fhe  Infernef  is 
so  unreliable,  if  doesn’f  make  sense  fo  fry  fo  mask  all  of  fhe  failures.  Our  goal  fhen  becomes  how  fo 
implemenf  fhe  “repair  by  reload”;  in  ofher  words,  how  fo  make  sure  fhe  sysfem  will  be  able  fo  mask 
any  failure  wifhin  a  few  seconds  fo  allow  refries  from  end-users  fo  succeed. 


29 


Frontend  Switching 

Figure  3.10  shows  how  we  mask  frontend  failures  from  users.  We  have  two  frontends, 
tarkin  .  cs  .  berkeley  .  edu  and  ackbar .  cs  .  berkeley .  edu,  backing  up  each  other  us¬ 
ing  IP  aliasing.  The  canonical  address,  gpx  .  cs  .  berkeley .  edu,  is  usually  an  alias  of  tarkin. 
The  other  machine,  ackbar,  checks  every  5  seconds  over  the  local  area  network  to  see  if  tarkin 
is  up.  When  it  can’t  contact  tarkin,  it  will  take  over  gpx.  ackbar  will  still  keep  checking  for 
tarkin  every  five  seconds,  and  will  release  the  gpx  alias  as  soon  as  it  finds  tarkin  back  up. 

tarkin  (primary)  ackbar  (backup) 

128.32.45.132  128.32.45.131 


Figure  3.10:  Frontend  swifching  protocol 

This  simple  mefhod  will  handle  all  four  of  fhe  cases  where  tarkin  or  ackbar  are 
down  or  fheir  nefwork  interfaces  are  nol  functional.  Table  3.1  summarizes  fhem.  Nofe  fhaf  when 
ackbar’s  nefwork  interface  is  broken,  ackbar  will  affempl  to  lake  over,  bul  fhe  clienls,  which 
can  only  reach  tarkin,  will  continue  to  connecf  to  tarkin  wifhouf  any  ill  effecls. 


Cause 

Response 

Clients  connected  to 

tarkin  down 

ackbar  takes  over 

ackbar 

tarkin  network  failure 

ackbar  takes  over 

ackbar 

ackbar  down 

(none) 

tarkin 

ackbar  network  failure 

ackbar  takes  over 

tarkin 

Table  3.1:  Frontend  swifching  effecls 


Backend  Failures  before  the  User  Connects 

Both  tarkin  and  ackbar  check  all  the  GridPix  servers  every  5  seconds.  This  check¬ 
ing  is  done  by  fetching  a  particular  file  from  the  machine;  the  file  will  not  be  available  if  the  HTTP 
server  is  down  or  the  disk  is  having  problems.  When  a  frontend  discovers  problems  on  one  of  the 
machines,  it  will  automatically  forward  any  subsequent  requests  from  users  to  the  backups.  Since 
the  CPU  load  on  the  frontends  are  very  low,  and  the  scripts  are  very  simple,  it  is  very  unlikely  that 
these  scripts  will  not  detect  problems  quickly. 


30 


Backend  Failures  while  the  User  Is  Connected 

The  above  two  methods  will  eover  most  eases  exeept  for  one — users  already  in  a  GridPix 
session  when  a  server  goes  down.  There  are  two  ways  to  mask  this  ease:  to  implement  something 
similar  to  the  frontends,  having  two  servers  baek  up  eaeh  other,  or  have  another  maehine,  possibly 
one  of  the  frontends,  to  take  over  the  IP  address  temporarily  and  forward  the  request  to  the  baekup. 
The  latter  way  is  better  for  two  reasons.  First,  the  baekends  do  not  need  to  know  whieh  maehines 
they  are  baeking  up.  When  data  is  moved  around,  or  when  additional  mirrors  are  added,  the  first 
method  will  require  updates  of  file  lists  on  the  baekends  in  addition  to  the  frontends.  The  seeond 
reason  is  that  is  that  it  is  easier  to  implement,  sinee  the  frontend  already  has  support  for  querying 
the  status  of  another  maehine  and  taking  over,  as  well  as  forwarding  requests  to  a  different  maehine. 

Load  Balancing 

Currently  we  are  just  redireeting  the  user  to  the  first  maehine  on  the  list,  sinee  the  load 
is  very  low  eompared  to  the  maehines’  eapaeity.  However,  our  frontend-baekend  protoeol  ean  also 
be  used  to  balanee  the  load  among  baekend  maehines  if  it  beeomes  neeessary.  By  having  the  baek¬ 
ends  put  some  information  in  the  file  fetehed  by  the  frontends,  they  ean  easily  communicate  to  the 
frontends  how  much  load  they  are  experiencing. 

3.3.2  Operating  System  Support 

In  this  section,  I  will  describe  what  modifications  we  had  to  make  to  the  operating  system 
in  order  to  build  our  system. 

SCSI/Disk  Subsystems 

The  SCSI  and  disk  subsystems  were  the  ones  that  caused  most  problems  for  us.  It  is 
understandable  because  commodity  hardware  is  not  usually  designed  with  a  large  server  system  like 
ours  in  mind.  Experience  with  scale  is  one  place  where  specialized  storage  system  manufacturers 
have  an  advantage. 

The  exact  nature  of  the  problems  and  our  fixes  to  them  are  not  part  of  the  main  focus  of 
my  research;  therefore,  they  are  not  covered  in  this  dissertation. 

Disk  Identification 

One  problem  with  having  several  hundred  disk  drives,  all  looking  identical,  is  that  it  is 
very  easy  for  the  operator  to  confuse  them.  SCSI  disks  are  identified  by  their  SCSI  IDs  within 
their  bus,  which  are  distinguished  by  the  host  adapter  number  within  the  machine,  which  are  in  turn 
identified  by  hostnames  and  IP  addresses.  Each  host  keeps  a  static  configuration  table,  typically 
called  /etc  /  f  stab,  which  describes  which  disk  is  mounted  on  which  directory  in  the  filesystem 
hierarchy.  In  cases  of  concatenated  or  striped  sets,  there  may  be  extra  configuration  files  describing 
the  setups  of  those  virtual  disks.  It  is  essential  that  relationship  between  the  disk  IDs  and  the  actual 
disks  kept  consistent. 

The  problem  is  that  if  two  disks  are  exchanged,  the  operating  system  may  not  be  able 
to  tell  the  difference;  it  will  just  use  the  disks  until  something  fails.  This  problem  is  especially 


31 


true  for  SCA  disks,  as  SCSI  IDs  are  speeified  on  the  disk  enelosures,  and  if  a  pair  of  disks  are 
aeeidentally  swapped,  their  IDs  will  ehange  with  the  loeation  in  the  enelosure.  There  eould  be 
several  different  eonsequenees  ranging  from  OS  erash  to  applieation  error;  some  of  them  involve 
eorruption  of  filesystems  and  are  extremely  dangerous,  espeeially  when  the  disk  is  part  of  a  striped 
set. 

We  needed  to  design  a  system  in  whieh  the  operating  system  will  be  able  to  notiee  when 
disks  are  installed  ineorreetly.  There  are  various  ways  to  implement  this,  with  trade-offs  in  the 
effeets  and  eomplexity  of  design.  Here  are  the  few  alternatives  we  eonsidered. 

Serial  Numbers 

Eaeh  disk  has  a  unique  serial  number  written  in  its  permanent  memory  at  the  faetory.  By 
reading  this  number,  the  operating  system  ean  ensure  that  all  the  disks  are  in  eorreet  loeations 
by  keeping  a  list  of  serial  numbers  of  the  disks  and  eomparing  disks  to  it  upon  eaeh  boot. 

Comment:  It  is  easy  to  implement,  but  will  not  let  us  do  more  than  simple  safe-guarding 
against  operator  errors. 

Disk  Label 

Eaeh  disk  has  a  “disk  label”  whieh  deseribes  the  partitioning  of  the  disks  as  well  as  some  other 
information  about  the  disk,  sueh  as  rotational  lateney  and  geometry  [BSD94].  It  is  possible 
to  expand  the  format  of  the  disk  label  to  inelude  some  more  information,  sueh  as  expeeted 
bus/SCSI  IDs  and  mount  points. 

Comment:  There  are  more  features  we  ean  implement  with  this  than  with  the  previous  option; 
it  ean  be  used  for  simple  safe-guarding  with  bus/SCSI  IDs  to  more  eomplex  tasks  sueh  as 
automatie  mounting.  Note  it  doesn’t  even  require  a  table  to  be  kept  on  the  system,  as  the 
disks  reeord  on  themselves  where  they  are  supposed  to  be  mounted  on  the  system.  However, 
sinee  the  disk  label  has  a  fixed  size,  there  is  still  a  limit  on  the  eomplexity  of  what  we  ean  do. 
Eor  instanee,  we’ll  need  to  add  speeial  fields  to  deseribe  if  the  disk  is  part  of  a  striped  set,  and 
if  it  is,  how  many  other  disks  are  there,  where  in  the  set  this  disk  appears,  and  so  on.  It  is  not 
very  extensible,  and  in  order  to  make  new  options  on  how  to  use  the  disk  available,  the  kernel 
or  the  utility  to  read  the  disk  label  needs  to  be  reeompiled. 

Script 

The  last  option  is  for  eaeh  disk  to  have  a  “known”  loeation — ^possibly  a  fixed  partition — on 
whieh  there  is  a  filesystem  where  there  exists  a  seript  that  is  to  be  exeeuted  in  turn  when  the 
system  boots. 

Comment:  What  we  ean  do  with  this  option  is  virtually  unlimited.  The  seripts  ean  be  used 
to  eheek  the  disk’s  identity — the  system  boot  proeess  just  needs  to  eall  the  disk’s  seript  with 
its  bus/SCSI  IDs  as  arguments.  It  ean  be  used  to  auto-mount  neeessary  filesystems,  or  it  ean 
be  used  to  eonstruet  more  eomplieated  entities  sueh  as  striped  arrays.  Note  that  in  either  ease, 
there  is  no  per-host  eonfiguration  required  on  the  boot  disks  themselves;  the  boot  disks  just 
read  in  the  seripts  and  exeeute  them  in  turn.  If  the  seripts  are  written  eleverly,  it  may  even  not 
matter  what  order  the  disks  are  inserted  in  a  partieular  enelosure. 

If  a  disk  is  moved  to  a  different  maehine,  it  is  not  possible  for  our  system  to  reetify  the 
situation  without  an  aid  of  an  operator;  we  would  need  some  kind  of  distributed  filesystem  to 
handle  sueh  eases. 


32 


I  have  implemented  the  last  option,  the  seript  method.  It  is  now  possible  to  move  disks 
around  within  the  same  maehine  and  still  have  them  mounted  eorreetly,  both  for  the  single  filesystem 
ease  and  striped  array  ease.  The  seript  is  made  to  flag  an  error  when  the  disk  is  inserted  in  an 
inappropriate  enelosure.  The  seripts  also  automatieally  seleet  the  eorreet  PC  of  the  double-ended 
pair  upon  startup,  and  will  refuse  to  be  eonfigured  if  the  disk  is  eonneeted  to  a  wrong  maehine.  The 
eheeking  is  done  by  having  the  startup  proeess  pass  an  extra  argument,  the  maehine  name,  to  the 
disks’  seripts. 

Fast  Reboot 

It  is  important  to  reduee  the  time  required  for  rebooting  the  system  in  order  to  minimize 
the  window  of  vulnerability.  Table  3.2  summarizes  the  effeets  of  various  improvements  I  have  made, 
or  have  explored  on  our  system.  The  numbers  are  based  on  a  system  with  three  SCSI  adapters. 


Reboot  step 

Improvement 

Old  time 

New  time 

Reduction 

Check  disks 

Use  soft  updates 

20  minutes 

28  seconds 

20  minutes 

System  BIOS 

BIOS -less  boot 

78  seconds 

0  seconds 

78  seconds 

SCSI  probe 

Reduce  SCSI  delay 

24  seconds 

11  seconds 

13  seconds 

Device  probe 

Remove  components 

31  seconds 

21  seconds 

10  seconds 

Total 

22  minutes 

60  seconds 

21  minutes 

Table  3.2:  Reboot  steps  and  timings 


These  are  more  detailed  explanations  on  eaeh  item,  in  deereasing  order  of  signifieanee. 

Fast  f  sck  The  largest  amount  of  time  the  maehine  takes  after  a  erash  is  spent  running  f  sck,  the 
UNIX  file  system  eonsisteney  eheek  and  repair  program  [MeK85].  It  takes  about  20  minutes 
to  run  f  sck  on  our  15-disk  striped  arrays.  In  the  past,  fsck  eheeked  all  the  filesystems 
during  every  reboot,  but  in  reeent  versions  of  UNIX,  this  has  beeome  only  a  problem  after  an 
unelean  shutdown — fsck  will  check  the  filesystem  flags  and  will  take  no  further  actions  if 
the  filesystem  was  unmounted  cleanly,  so  unless  the  operating  system  crashed  or  the  machine 
hung  and  was  power-cycled,  fsck  will  complete  in  less  than  a  second.  Note  that  multiple 
fsck  processes  can  be  run  in  parallel,  so  the  number  of  such  arrays  do  not  change  the  total 
amount  of  time  it  takes  to  check  all  filesystems. 

Kirk  McKusick,  the  author  of  f  s  c  k  and  architect  of  the  original  B  SD  Fast  Filesystem  [McK84]  ,| 
has  been  working  on  a  project  called  “soft  updates”,  in  which  by  changing  the  way  dirty  data 
is  written  to  the  FreeBSD  filesystem,  both  the  performance  and  reliability  improves  greatly. 
These  filesystems  also  do  not  need  fsck  to  be  run  even  after  a  crash  [McK98].  The  filesys¬ 
tem  may  lose  some  space,  but  the  data  structures  on  the  filesystem  will  be  consistent  with 
each  other.  We  still  need  to  run  fsck  from  time  to  time  to  reclaim  the  space,  but  it  can  be 
done  at  any  time,  not  right  after  a  crash. 

I  have  been  using  soft  updates  on  our  machines  for  over  a  year  with  no  ill  effects.  When  I 
started  using  soft  updates,  it  was  supplied  as  external  “snapshots”  from  McKusick’s  reposi¬ 
tory;  the  code  is  now  in  FreeBSD’s  main  source  tree. 


33 


BIOS-Less  Boot  I  have  also  explored  the  possibility  of  implementing  a  “quiek  boot”  option  of  the 
operating  system,  whieh  doesn’t  use  a  hardware  reset.  Our  systems  take  78  or  57  seeonds, 
depending  on  whether  they  have  three  SCSI  adapters  or  two,  just  to  go  through  the  BIOS. 
Most  of  the  time  is  spent  while  SCSI  eards  are  initializing  themselves,  but  this  is  not  really 
neeessary  for  a  warm  boot.  Also,  about  5  seeonds  is  spent  eheeking  available  memory,  whieh 
is  obviously  useless  too  if  we  are  rebooting  a  previously-running  maehine. 

By  having  a  reboot  just  jump  to  the  initialization  eode  in  the  kernel,  we  ean  eompletely  elim¬ 
inate  the  lengthy  BIOS  proeess.  However,  various  kernel  modules  are  not  neeessarily  written 
to  be  reentrant,  and  although  I  did  get  some  favorable  responses  when  I  mentioned  this  in  the 
FreeBSD  developers’  mailing  lists,  I  did  not  have  time  to  aetually  implement  this. 

Reduce  Time  of  Wait  for  SCSI  Devices  The  delay  for  SCSI  probe  after  a  SCSI  bus  reset  was  8 
seeonds  per  bus,  but  for  our  system,  it  eould  be  redueed  to  2  seeonds  without  any  notieeable 
effeet.  The  default  delay  was  based  on  a  pessimistie  expeetation  that  the  user  may  have  an 
old  deviee  that  takes  a  long  time  to  reset  itself  after  a  SCSI  bus  reset  is  issued. 

Remove  Unneeded  Kernel  Components  By  removing  unneeded  kernel  eomponents,  the  kernel 
becomes  much  smaller  and  uses  less  memory  during  normal  operation.  It  is  also  faster  to 
load  by  about  half  a  second  and  takes  much  less  time  to  boot  because  it  does  not  have  to 
probe  devices  that  do  not  exist  on  our  system.  The  removal  of  unneeded  components  has 
reduced  about  10  seconds  from  the  boot  time,  from  about  31  seconds  to  21  seconds.  Of  the 
21  seconds,  11  seconds  are  used  searching  for  the  second  IDE  drive — the  drive  our  machines 
don’t  have,  but  are  in  the  kernel  so  we  can  make  duplicates  of  system  disks  quickly  when 
there  is  a  IDE  disk  drive  failure.  By  removing  this  driver,  it  can  be  further  reduced  to  10 
seconds. 


3.4  Summary 

The  Tertiary  Disk  group  has  built  a  3.2TB  storage  system  consisting  of  364  SCSI  hard 
drives  and  24  PCs,  including  frontends  and  infrastructure  servers.  We  have  also  written  an  appli¬ 
cation,  a  web-based  image  database  server,  and  modified  the  operating  system  in  order  to  run  our 
application  more  reliably  and  make  it  easier  to  maintain.  The  system  has  been  online  since  March 
of  1998  and  has  received  favorable  reactions  from  users.  Other  than  campus-wide  network  or  power 
outages.  Tertiary  Disk  has  not  had  an  outage  that  made  images  unavailable  to  outside  users. 


34 


Chapter  4 

The  Cost  of  System  Administration 


As  mentioned  earlier,  while  storage  system  hardware  has  gotten  eheaper  over  the  years, 
the  eost  of  system  administration  has  remained  high.  Sinee  the  deerease  in  eost  of  hardware  is 
not  showing  any  signs  of  slowing  down,  this  situation  is  not  likely  to  ehange  in  the  near  future.  If 
system  administration  eost  is  many  times  that  of  hardware  eost,  elearly  there  is  little  or  no  point  in 
trying  to  reduee  the  eost  of  hardware.  Thus,  for  our  system  to  make  sense,  it  is  neeessary  that  we 
reduee  the  eost  of  system  administration  by  at  least  an  order  of  magnitude.  There  are  several  means 
to  aeeomplish  this.  We  foeus  on  two  methods  in  this  dissertation. 

One  is  to  make  the  system  self -maintaining.  By  having  the  storage  system  maintain  itself, 
we  will  be  able  to  reduee  the  burden  on  the  system  administrator,  whieh  in  turn  results  in  lower  eost 
and  higher  reliability.  The  issue  of  system  administration  is  not  only  that  of  monetary  eost.  A  system 
that  funetions  eontinuously  without  an  aid  of  a  human  operator  has  a  higher  availability  than  one 
that  requires  human  intervention  to  keep  it  running.  Also,  the  pressure  on  system  administrators  to 
fix  the  problems  immediately  often  results  in  human  errors  that  eompound  the  hardware  problems. 
With  large  storage  systems,  results  of  sueh  operator  errors  ean  be  disastrous.  The  self-maintaining 
aspeet  of  the  system  is  diseussed  in  Seetion  4.1. 

The  other  is  to  make  the  system  easier  for  the  operator  to  maintain.  These  methods  are 
discussed  in  Section  4.2.  It  is  ideal  to  have  the  system  be  completely  care-free,  but  there  are  many 
aspects  of  storage  system  administration  that  cannot  be  made  fully  automatic.  However,  there  are 
other  ways  to  reduce  the  burden  of  the  system  administrator.  Although  these  improvements  are  hard 
to  quantify,  they  are  nonetheless  useful  in  increasing  the  availability  of  the  system,  thus  indirectly 
reducing  the  cost  of  administration.  For  instance,  a  system  which  safeguards  against  operator  errors 
will  have  a  better  uptime  than  a  system  that  requires  the  operator  to  make  no  errors.  An  example  of 
such  a  “safety  belt”  is  the  automatic  disk  identification  scheme  mentioned  in  Section  3.3.2.  Another 
way  to  make  the  system  easier  to  maintain  is  to  reduce  the  time  it  takes  to  upgrade  the  operating 
system,  a  topic  that  has  been  researched  in  the  past  in  the  context  of  binary  upgrades  to  commercial 
operating  systems  [SMH95].  If  the  system  administrator  has  to  work  late  into  the  night  to  upgrade 
a  cluster,  the  person  is  more  likely  to  make  mistakes  near  the  end. 

One  thing  my  research  does  not  address  is  management  of  data  on  a  large  storage  system. 
Management  of  data  involves  keeping  track  of  where  everything  is  located,  reallocating  space  as 
necessary,  and  so  on.  Although  it  is  an  interesting  research  field,  there  are  other  projects  that  are 
pursuing  it  and  I  did  not  think  we  could  make  my  approach  generic  enough  to  be  of  a  wide  interest 


35 


[BGM+96,  Wal98,  Jir99].  For  a  summary  of  other  research  projects  on  this  topic,  please  refer  to 
Section  1.3.4. 


4.1  Self-Maintaining  System 

In  this  section,  I  will  define  what  “self-maintaining”  means,  and  outline  the  requirements 
on  how  to  construct  such  a  system. 

4.1.1  Definition  of  Self-Maintaining  System 

There  are  two  aspects  to  self-maintainability  of  a  system,  depending  on  whether  the  per¬ 
spective  is  that  of  the  system  administrator  or  of  the  user. 

System  Administrator’s  Perspective 

To  the  system  administrator,  a  self-maintaining  system  is  one  that  does  not  require  con¬ 
stant  attention.  For  the  purpose  of  my  research,  I  define  if  as  a  sysfem  fhaf  is: 

•  designed  fo  funcfion  wifh  only  pre-scheduled,  say,  weekly  mainfenance,  and 

•  for  which,  af  mainfenance  time,  fhe  required  fasks  are  clearly  defined. 

There  is  nofhing  special  abouf  fhe  repairs  having  fo  be  weekly.  The  visifs  can  be  fwice  a  week, 
monfhly,  or  even  daily.  The  imporfanf  pari  is  lhal  fhe  visifs  are  pre-scheduled  and  nol  driven  by 
evenfs  such  as  hardware  failures.  My  research  goal  is  fo  show  lhal  if  is  possible  fo  build  a  sysfem 
fhaf  runs  wifh  a  reasonably-spaced  scheduled  visifs  by  humans. 

There  are  Iwo  major  benefils  of  Ibis  approach: 

Reduce  Operator  Errors 

Regularly  scheduled  visifs  will  help  fhe  operalor’s  performance  and  reduce  mislakes.  An 
operator  lhal  is  working  during  regular  hours  are  less  likely  fo  make  errors  lhan  one  lhal  was 
paged  al,  say,  3  A.M.  Also,  nol  being  under  fhe  pressure  of  having  to  fix  fhe  system  righl 
away  will  likely  reduce  fhe  chances  of  operalor  errors.  Sludies  supporling  Ihese  claims  can 
be  found  in  Section  2.2.2. 

Reduce  Operator  Cost 

As  mentioned  in  Section  2.3,  anolher  benefil  is  lhal  if  is  nol  necessary  to  pay  a  full-lime  wage 
for  a  system  adminislralor  of  such  a  system.  We  can  eilher  hire  someone  parl-lime  or  have 
Ihe  same  person  administer  many  systems  instead  of  jusl  one  or  Iwo. 

The  second  rationale  is  similar  to  lhal  behind  Tandem’s  TMDS  [BarSl,  Tan97b].  Tandem  has 
engineers  al  Ihe  supporl  center  24  hours  a  day,  and  systems  having  problems  will  automatically 
dial  up  Tandem  to  conlacl  one  of  Ihe  supporl  persons.  A  few  hours  after  Ihe  failure,  Ihe  supporl 
person  will  show  up  al  Ihe  customer’s  site  wilh  necessary  repair  parls.  Thus,  Ihe  companies  lhal 
purchase  supporl  conlracls  do  nol  have  to  hire  operators  by  Ihemselves;  Ihey  are  in  effecl  “sharing” 
Ihe  operators  wilh  olher  companies  Ihrough  Tandem. 


36 


Our  system  will  take  this  one  step  further.  There  is  no  need  for  anyone  to  be  at  anywhere 
in  the  middle  of  the  night;  in  faet,  there  is  no  need  for  anyone  to  be  at  any  eentral  support  eenter.  The 
operators  ean  travel  from  one  eustomer  site  to  another  throughout  the  week,  or  they  ean  be  doing 
something  else  during  most  of  the  week.  By  hiring  a  part-time  operator  that  only  eomes  to  work 
onee  per  week,  an  organization  ean  eut  the  human  eost  of  administration  to  one  fifth. 

User’s  Perspective 

There  are  two  elasses  of  users  on  a  web-based  storage  system  like  ours. 

End  Users 

These  are  the  people  who  use  the  system  from  the  Internet.  They  know  nothing  about  the 
internals  of  the  system,  will  easily  get  annoyed  if  something  doesn’t  work,  and  may  never 
return  to  our  site  if  they  are  unhappy.  They  usually  only  issue  reads  to  the  system. 

Content  Providers 

There  are  relatively  few  people  whose  job  is  to  update  the  eontents  of  the  web  server.  They  are 
part  of  the  projeet,  and  ean  wait  for  a  while  if  the  system  eannot  allow  writes  at  the  moment. 
They  have  a  good  knowledge  on  how  to  use  the  system,  but  usually  know  little  about  the 
internal  workings. 

To  reduee  end-user  frustration,  it  is  important  to  reduee  the  system’s  down-time  as  seen  from  aeross 
the  Internet.  Note  that  a  system  that  relies  on  an  operator  to  keep  it  running  is  not  as  available  as  one 
that  maintains  itself,  as  it  will  take  minutes  or  maybe  even  hours  for  the  operator  to  aetually  be  able 
to  repair  the  damage  [GS91].  Our  goal  is  to  have  the  system  repair  any  interruption  of  serviee  within 
a  few  seeonds,  and  eontinue  to  funetion  unattended  until  the  next  seheduled  visit  by  the  operator. 
For  a  web  server  applieation  sueh  as  the  one  we  are  running,  this  is  illustrated  by  the  slogan  “repair 
by  reload”. 

As  Mary  Baker  said  in  her  Ph.D.  thesis,  “in  the  limit,  as  reeovery  time  approaehes  zero,  a 
system  with  fast  erash  reeovery  is  indistinguishable  from  a  system  that  never  erashes  at  all”  [Bak93]. 
Her  researeh  was  about  file  servers  on  disfribufed  operafing  system,  and  fhe  above  sfafemenf  was 
qualified  fhaf  if  is  only  appropriate  in  sysfems  fhaf  ean  tolerate  shorf  periods  of  down-lime,  as  a 
elusfer  of  workslalions  in  a  lypieal  engineering  or  researeh  environment. 

I  believe  our  applieation,  a  web  server,  is  another  example  where  a  system  with  fast  erash 
reeovery  is  just  as  good  as  a  system  that  never  erashes.  This  elaim  is  based  on  the  observation  that 
sinee  the  Internet  is  so  unreliable,  it  will  be  impossible  for  the  user  to  distinguish  problems  on  our 
servers  from  the  daily  transient  problems  of  the  network. 

For  the  eontent  providers,  the  situation  is  a  little  different.  It  is  permissible  to  have  the 
system  be  in  a  state  that  it  eannot  reeeive  input  from  them  for  a  short  while.  However,  repeated 
problems  will  affeet  their  produetivity  as  well  as  delay  the  update  of  the  eontents,  so  it  is  desirable 
to  keep  downtimes  as  short  and  as  infrequent  as  possible. 

In  order  to  partially  shield  problems  from  the  eontent  providers,  we  offer  a  portal  to 
whieh  they  ean  upload  the  new  images.  Onee  the  data  is  in  the  portal,  they  ean  go  on  to  their  work. 
However,  if  there  is  a  problem  with  some  other  part  of  the  system  that  disallows  writes,  the  new 
images  may  not  be  available  on  the  web  until  the  problem  is  fixed. 


37 


4.1.2  Requirements 

There  are  several  requirements  for  building  a  self-maintaining  system.  Here  the  require¬ 
ments  will  only  be  listed;  the  implementation  details  ean  be  found  in  Seetions  3.2  and  3.3. 

No  Single  Point  of  Failure 

Sueh  a  system  is  not  allowed  to  have  any  single  point  of  failure.  This  restrietion  makes  it 
possible  to  deeouple  the  time  of  repair  from  the  time  of  failure,  allowing  the  system  to  run 
under  an  existenee  of  a  failure.  Clearly,  depending  on  how  many  failures  the  system  should 
tolerate,  it  may  be  neeessary  to  have  even  more  redundaney.  Not  having  a  single  point  of 
failure  is  a  minimum  requirement  for  any  system  that  is  designed  to  funetion  eontinuously  in 
an  event  of  a  eomponent  failure. 

Constant  and  Reliable  Monitoring 

The  system  should  be  eonstantly  and  reliably  monitored  so  eorreetive  aetions  are  taken  very 
quiekly  after  failures.  We  use  very  simple  shell  seripts  for  monitoring,  and  have  an  interval 
of  four  seeonds  of  sleeps  between  monitoring.  The  monitoring  seripts  are  autonomous,  so  a 
failure  on  one  maehine  will  not  eause  a  monitoring  seript  on  another  maehine  to  malfunetion. 

There  are  some  eharaeteristies  in  our  applieation  that  makes  it  easy  to  build  a  self-maintaining 
system.  Although  these  are  not  hard  requirements,  they  nonetheless  have  helped  us  simplify  the 
design  of  the  system. 

End-users  Issuing  Only  Reads 

One  important  aspeet  of  our  system  is  the  read-mostly  nature  of  end-user  aeeesses.  We  believe 
this  is  not  unique  to  our  applieation;  many  other  applieations  with  similar  terabyte-eapaeity 
seale  share  the  same  eharaeteristies.  By  not  having  to  allow  writes  in  degraded  mode,  imme¬ 
diate  reeovery  is  only  a  matter  of  loeating  the  baekup  and  rerouting  user  requests  there  while 
more  lengthy  reeovery  proeedures  ean  take  plaee. 

Little  Internal  Communication 

The  applieation  needs  very  little  internal  eommunieation  to  handle  user  requests.  See  See- 
tion  3.3.1  for  details  on  how  user  requests  are  handled.  This  low  usage  simplifies  the  reeovery 
as  there  are  only  few  messages  or  eonneetions  that  might  be  lost  due  to  a  failure.  Also,  the 
low  internal  eommunieation  overhead  enables  the  system  to  seale  up  nieely  in  the  future. 

4.2  A  System  That  Is  Easy  To  Maintain 

There  are  several  other  improvements  we  have  made  to  our  system  to  reduee  the  burden 
on  the  system  administrator.  Although  they  do  not  relate  direetly  to  the  “self-maintaining”  aspeet  of 
the  system,  they  are  nonetheless  important  for  the  maintainability  of  the  system  and  are  deseribed 
here  as  part  of  the  effort  to  reduee  the  overall  eost  of  maintenanee. 

As  mentioned  earlier,  one  of  the  most  dramatie  failures  of  our  system  was  eaused  by  the 
operator  replaeing  a  wrong  disk  and  attempting  to  reeonstrueting  a  striped  filesystem,  destroying  the 
only  good  eopy  of  mirrored  data.  In  order  to  avoid  making  any  more  similar  mistakes,  I  implemented 
a  disk  identifieation  meehanism  that  will  automatieally  deteet  when  a  disk  is  eonneeted  to  the  wrong 


38 


machine.  It  has  also  been  useful  when  we  had  to  replace  disk  enclosures — no  matter  what  order 
the  disks  are  inserted  into  the  enclosure,  the  striped  set  will  always  be  configured  correctly.  If 
the  ordering  of  disks  doesn’t  matter,  it  means  one  less  thing  to  worry  about  when  the  operator 
is  performing  a  time-consuming  and  stressful  operation  like  replacing  a  disk  enclosure,  therefore 
reducing  the  chance  of  human  errors  in  subsequent  steps. 

4.2.1  Disk  Identification 

As  mentioned  in  Section  3.3.2,  our  system  will  not  lose  data  if  a  disk  is  inserted  in  a  wrong 
location  in  the  cluster,  even  if  it  is  part  of  a  striped  array.  The  method  I  implemented  to  accomplish 
this  is  very  generic  and  extensible. 

4.2.2  Taking  Advantage  of  Redundancy 

There  are  several  ways  to  take  advantage  of  redundancy  in  ways  not  possible  on  a  tradi¬ 
tional  centralized  server.  I  will  describe  a  couple  of  those  methods  in  this  section. 

Preventive  Maintenance 

People  at  Inktomi  have  found  that  their  system  can  be  made  more  reliable  by  restarting 
the  application  periodically  [BL98].  The  reliability  problems  are  caused  because  of  thread  leak 
problems  in  the  server  process.  Instead  of  spending  many  man-weeks  trying  to  track  down  the  last 
thread  leak  bug,  it  is  much  easier  to  clear  the  problems  by  a  restart.  Since  their  servers  are  fully 
redundant,  it  will  not  cause  any  interruption  of  service  for  the  users. 

Large  servers  are  known  to  develop  some  bad  processes  from  time  to  time.  Anyone  who 
has  been  in  a  clustered  computing  environment  has  seen  many  messages  from  the  system  adminis¬ 
trator  saying  something  like: 

“The  file  server  is  going  fo  have  a  quick  reboof  today  af  12:15  fo  clear  wedged  NFS 
processes — ^please  send  mail  by  11:30  if  you  have  a  problem”. 

One  solution  fo  fhis  is  fo  periodically  reboof  fhe  server  on  off-peak  hours,  insfead  of  waif- 
ing  for  problems  fo  develop.  If  will  be  even  easier  for  bofh  fhe  users  and  fhe  system  adminisfralor 
if  fhe  machines  can  be  rebooted  wifhouf  any  inferrupfion  of  service.  System  adminisfrafors  will 
especially  benefif  from  if,  since  fhey  will  no  longer  have  to  worry  abouf  when  to  schedule  fhe  re- 
hoof,  check  users’  mails  to  see  if  if  is  really  ok  fo  reboof  if,  and  fhen  waif  af  fhe  console  hoping  fhe 
machine  will  come  up  in  fhree  minufes  as  scheduled. 

We  have  discussed  implemenfing  similar  mefhods  wifh  our  system.  In  our  case,  if  is  nof 
fhe  soffware,  buf  fhe  hardware  fhaf  benefifs  from  reboofs.  The  SCSI  disks  we  are  using  sometimes 
enfer  a  certain  slate  from  which  if  doesn’l  reply  fo  SCSI  commands  anymore;  if  requires  a  reboof 
to  fix.  I  suspecl  fhey  are  due  to  bugs  in  disk  firmware  fhaf  cause  fhe  sfale  machine  fo  go  into  a 
silualion  fhaf  is  nof  foreseen  by  fhe  designer.  I  have  observed  fhaf  fhe  problem  is  caused  much  more 
often  when  Ihere  is  a  lol  of  Iraffic  fo  several  disks  on  fhe  SCSI  bus.  Mosl  of  fhe  lime,  fhe  problems 
are  recoverable  by  sending  SCSI  bus  resels  from  fhe  disk  driver — we  have  worked  wifh  FreeBSD 
developers  and  had  Ihem  add  code  fo  do  jusl  Ibis — buf  somelimes  fhe  disks  will  nof  respond  fo  bus 
resels,  and  fhe  machine  eilher  loses  access  fo  fhe  disk  or  crashes. 


39 


Since  these  problems  are  usually  preceded  by  a  distinct  kernel  message,  we  can  prevent 
further  damage  by  automatically  rebooting  the  machine  at  first  sight  of  the  problem. 

Testing  New  Systems 

When  new  software,  either  operating  system  or  application,  arrives  at  a  site,  it  is  usually 
installed  on  few  test  machines  that  are  set  aside  for  this  purpose.  Only  after  the  system  administrator 
is  confident  about  the  reliability  of  these  systems,  they  will  be  deployed  in  the  main  cluster. 

There  are  several  problems  to  this  approach.  First,  the  performance  of  the  test  system  is 
not  necessarily  a  good  indication  of  the  real  system.  Even  with  artificially  generated  high  loads,  the 
test  is  never  going  to  be  equivalent  to  the  real  workload.  It  is  possible  to  trace  the  real  workload 
and  simulate  it  on  the  test  machine,  but  this  is  very  expensive  and  will  require  an  experienced 
administrator  to  do  it  right.  Second,  the  test-bed  is  not  contributing  to  the  overall  performance  of 
the  system,  as  it  does  not  serve  user  requests.  In  essence,  the  price  and  administration  cost  of  the 
test-bed  is  an  extra  overhead  in  the  cost/performance  equation. 

However,  with  a  fully  redundant  and  independent  system  like  ours,  it  is  possible  to  just  go 
ahead  and  install  the  new  version  of  software  on  few  of  the  production  machines.  Since  the  system 
as  a  whole  is  designed  to  tolerate  one  or  more  machines  going  down,  even  the  result  of  a  completely 
dysfunctional  software  upgrade  is  not  as  disastrous  as  it  used  to  be. 

Note  that  this  method  only  applies  to  subsystems  which  are  expected  to  either  work  or  fail, 
with  no  middle  ground.  In  particular,  if  there  are  updates  to  the  data  through  a  particular  subsystem, 
it  will  still  be  too  dangerous  to  use  a  version  of  unknown  quality  as  this  could  result  in  cormpted 
data. 

4.2.3  Modular  Upgrading 

Upgrading  a  cluster  of  computers  is  always  difficult,  and  one  that  doesn’t  have  a  simple 
solution.  People  usually  write  custom  scripts  to  do  it  semi-automatically  [SMH95]. 

Problems  with  the  Build  Process 

FreeBSD  is  quite  unique  in  this  respect,  as  not  only  is  the  full  source  tree  provided,  it  is 
also  carefully  maintained  so  users  can  rebuild  the  entire  system  with  one  command  (cd  /usr/src;| 
make  world).  This  command  can  be  used  to  upgrade  the  system  just  by  obtaining  the  latest 
sources.  FreeBSD  also  offers  the  full  revision  history  of  the  source  tree,  which  enables  compiling 
the  entire  system  as  it  was  at  any  point  in  time  in  the  past,  making  it  possible  to  select  a  “known 
good”  source  tree  to  build  a  system  from  after  watching  the  mailing  lists  for  a  few  days. 

However,  this  approach  has  its  drawbacks.  If  you  want  to  compile  the  operating  system 
and  not  extract  it  from  a  binary  distribution,  a  fairly  complicated  process  is  required  to  make  sure 
all  the  tools  and  dependencies  are  built  in  the  right  order.  For  instance,  all  binaries  need  to  be  built 
with  new  libraries,  which  may  need  new  compilers,  which  may  need  new  text  formatters  to  process 
the  header  files,  and  so  on. 

This  problem  used  to  be  solved  by  installing  necessary  components  first  on  the  host  sys¬ 
tem,  then  running  a  full  compilation.  Other  than  the  danger  of  having  a  partially  upgraded  system 


40 


when  the  eompilation  fails  in  the  middle,  another  drawbaek  of  this  approaeh  is  that  it  puts  a  high 
load  on  the  network  when  a  eluster  of  eomputers  is  sharing  the  same  NFS  filesystem  for  sourees. 

Separating  Build  and  Install  Phases 

I  made  it  possible  to  segregate  the  build  and  install  proeesses  by  ehanging  the  build  pro- 
eess  to  install  neeessary  eomponents  in  a  temporary  tree  instead  of  the  host  system’s  system  di- 
reetories.  The  old  “world”  target  is  now  eomposed  of  two  targets,  “buildworld”  and  “in- 
stallworld”,  whieh  don’t  neeessarily  have  to  be  run  on  the  same  system,  “buildworld”  will 
rebuild  the  entire  system  without  affeeting  the  host  system;  “installworld”  takes  the  result  of 
“buildworld”  and  installs  it.  Using  this  method,  I  have  sueeessfully  upgraded  our  systems  three 
times.  By  running  “buildworld”  on  a  fileserver  and  then  “installworld”  on  eaeh  of  the  20 
elients,  the  entire  system  ean  be  upgraded  in  a  few  hours,  with  only  a  minimal  amount  disk  spaee 
used. 

Note  that  under  the  old  method,  I  would  either  have  had  to  run  “world”  on  one  of  the 
elients,  eausing  the  entire  system  to  be  eompiled  over  NFS,  or  duplieate  the  souree  tree  on  all 
the  maehines.  Running  a  large  eompilation  takes  a  long  time;  for  instanee,  on  our  maehines,  the 
“buildworld”  takes  about  1  1/2  hours  if  the  souree  tree  is  on  the  loeal  disk,  and  almost  4  hours 
when  it  is  on  an  NFS-mounted  remote  disk,  for  an  improvement  of  2  1/2  hours. 

Also,  the  new  method  makes  it  possible  to  reeover  from  faulty  systems  easier,  sinee  build¬ 
ing  is  done  on  a  eompletely  separate  environment.  If  I  installed  a  version  of  a  system  that  is  unre¬ 
liable,  all  that  is  required  is  a  few  tools  (make,  install,  ete.)  and  funetional  NFS  to  go  baek  to 
any  known  good  snapshot. 

4.3  Summary 

There  are  two  benefits  in  redueing  the  system  administrator’s  burden  by  making  the  sys¬ 
tem  maintain  itself.  One  is  that  a  system  that  automatieally  maintains  itself  will  be  more  available 
than  a  system  that  relies  on  humans  to  keep  it  running.  The  other  is  that  a  system  that  is  easy  to 
maintain  ean  be  eheaper  beeause  a  full-time  system  administrator’s  salary  is  quite  expensive. 

This  ehapter  analyzed  these  benefits  and  deseribed  some  of  the  improvements  I  made 
to  our  system  to  reduee  the  eost  of  system  administration.  Those  inelude;  taking  advantage  of 
redundaney  using  preventive  maintenanee  and  field-testing  of  new  systems,  and  modular  upgrades 
of  a  eluster  of  maehines. 


41 


Chapter  5 

Failures 


We  have  been  observing  our  system  earefully  sinee  we  eonsttueted  it.  The  objeetive  is  to 
eolleet  enough  data  to  run  a  simulation  to  prove  that  the  system  will  aetually  run  eontinuously  and 
flawlessly  in  the  existenee  of  failures.  The  failure  reeords  are  during  the  24-month  period  from  July 
1997  to  June  1999.  In  this  ehapter,  I  mention  how  the  data  was  eolleeted  and  presents  the  results  of 
analysis  on  the  data.  The  simulation  itself  is  deseribed  in  the  next  ehapter. 

Nisha  Talagala  has  presented  another  analysis  of  the  same  system  [Tal99].  The  data  in 
this  seetion  differs  in  that  I  have  analyzed  data  for  a  longer  period  of  time,  intending  to  eolleet 
as  mueh  data  as  possible  for  the  simulation  I  was  going  to  run,  while  she  made  a  more  in-depth 
investigation  on  eaeh  of  the  failures,  with  extra  emphasis  on  how  eomponents  gradually  deteriorate 
over  time  until  they  fail.  The  data  that  is  ineluded  in  my  summary  and  not  in  hers  inelude  repair 
aetions,  amount  of  time  it  takes  for  repair,  and  elustering  of  failures.  The  results  that  are  eommon 
between  our  data  are  eonsistent  with  eaeh  other. 


5.1  Collecting  Data 

We  tried  to  keep  the  log  of  all  the  failures  we  experieneed.  Many  errors  have  distinet 
entries  in  system  logs.  I  have  modified  the  system  to  keep  the  logs  for  mueh  longer  than  the  default. 
In  addition  to  observing  the  eomponents  under  normal  use,  we  subjeeted  some  disks  to  artifieial 
loads  to  see  if  it  will  make  any  differenee  in  failure  rates.  This  experiment  in  part  augments  the  low 
loads  we’re  seeing  on  the  Zoom  projeet. 

There  are  two  logs  in  the  system  that  ean  be  used  to  observe  eomponent  behavior.  They 
are  the  main  system  log,  and  the  HTTP  server  log.  In  addition,  I  have  kept  notes  of  most  of  the 
repairs  that  I  performed. 

5.1.1  System  Log 

The  main  system  log  (ealled  /var/ log /messages  in  FreeBSD)  is  where  all  the  kernel 
messages,  as  well  as  messages  from  any  proeess  using  the  syslog  faeility,  go.  Disk  problems 
usually  first  show  up  here  as  a  eontinuous  stream  of  retries  and  failures.  Proeesses  getting  killed  due 
to  various  reasons  are  reeorded  here  too.  Most  of  them  are  segmentation  faults  due  to  programming 
bugs,  while  there  are  some  others,  sueh  as  pager  faults,  that  are  eaused  by  hardware  problems. 


42 


Jul 

24 

10:40:09 

Jul 

24 

10:40:09 

Jul 

24 

10:40:09 

Jul 

24 

11 : 02  : 11 

Jul 

24 

11:02: 12 

Jul 

24 

11:02: 12 

Jul 

24 

11:02: 12 

Jul 

24 

11:02: 12 

Jul 

24 

11:02: 12 

Jul 

24 

11:02: 12 

Jul 

24 

11:02:22 

Jul 

24 

11:02:22 

Jul 

24 

11:02:22 

Jul 

24 

11:02:24 

Jul 

24 

11:02:24 

Jul 

24 

11:02:24 

Jul 

24 

11:02:24 

Jul 

24 

22:41:29 

Jul 

24 

22:41:29 

(da73:ahc4:0:9:0) :  READ(IO).  CDB:  28  0  0  50  54  cf  0  0  80  0 

(da73 : ahc4 : 0 : 9 : 0) :  RECOVERED  ERROR  info:505546  asc:18,2 
(da73 : ahc4 : 0 : 9 : 0) :  Recovered  data  -  data  auto-reallocated  sks:80,12 
(da4 0 : ahc2 : 0 : 8 : 0 ) :  SCB  0x25  -  timed  out  while  idle,  LASTPHASE  ==  0x1, 
SCSISIGI  ==  0x0  SEQADDR  ==  0x8 
(da4 0 : ahc2 : 0 : 8 : 0 ) :  Queueing  a  BDR  SCB 
(da40 : ahc2 : 0 : 8 : 0) :  Bus  Device  Reset  Message  Sent 
Bus  Device  Reset  Completed. 

(da4 0 : ahc2 : 0 : 8 : 0 ) :  no  longer  in  timeout 

ahc2 :  Bus  Device  Reset  delivered.  2  SCBs  aborted 

(da4 0 : ahc2 : 0 : 8 : 0 ) :  SCB  0x25  -  timed  out  in  command  phase, 

SCSISIGI  ==  0x4  SEQADDR  ==  0x146 

(da4 0 : ahc2 : 0 : 8 : 0 ) :  BDR  message  in  message  buffer 

(da4 0 : ahc2 : 0 : 8 : 0 ) :  SCB  0x8e  -  timed  out  in  command  phase, 

SCSISIGI  ==  0x14  SEQADDR  ==  0x146 

(da4 0 : ahc2 : 0 : 8 : 0 ) :  no  longer  in  timeout 

ahc2 :  Issued  Channel  A  Bus  Reset.  4  SCBs  aborted 

(da7 8 : ahc4 : 0 : 1 4 : 0 ) :  SCB  0x51  -  timed  out  while  idle,  LASTPHASE  ==  0x1, 
SCSISIGI  ==  0x0  SEQADDR  ==  Oxb 

Figure  5.1:  Sample  system  log 


Unfortunately,  not  all  of  the  system  logs  are  still  available.  Some  of  them  have  been  lost 
to  disk  failures,  while  some  others  have  been  deleted  due  to  operator  earelessness.  At  the  beginning 
of  the  projeet,  we  were  not  expeeting  the  logs  to  be  very  preeious.  Table  5.1  shows  how  long  baek 
the  logs  go  on  eaeh  of  our  maehines.  As  ean  be  seen  in  the  last  line,  even  though  a  fair  number  of 
earlier  logs  are  lost,  there  are  still  an  average  of  over  17  out  of  24  months  of  system  logs  remaining. 


Host 

Log  From 

Months 

Host 

Log  From 

Months 

mO 

Feb  1998 

17 

mlO 

Feb  1998 

17 

ml 

Jul  1997 

24 

mil 

Sep  1998 

10 

m2 

Feb  1998 

17 

ml2 

Jul  1998 

12 

m3 

Feb  1998 

17 

ml3 

Feb  1998 

17 

m4 

Sep  1998 

10 

ml4 

Apr  1998 

15 

m5 

Feb  1998 

17 

ml5 

Apr  1998 

15 

m6 

Dee  1997 

19 

ml6 

Jul  1997 

24 

m7 

Sep  1997 

22 

ml7 

Sep  1998 

10 

m8 

Feb  1998 

17 

ml8 

Jul  1997 

24 

m9 

Nov  1997 

20 

ml9 

Aug  1997 

23 

Total 

347 

Average 

17.35 

Table  5.1:  Log  durations 


As  an  example  of  what  they  look  like.  Figure  5. 1  shows  some  lines  from  one  of  the  logs  for 
the  maehine  mO  .  cs  .  berkeley .  edu.  The  year  is  1998.  To  improve  readability,  1  have  removed 
some  fields  to  fit  the  lines  here  without  exeessive  wrapping. 

The  following  is  what  we  ean  deduee  from  this  partieular  fragment.  First,  at  10:40  AM, 


43 


the  disk  da  7  3,  with  SCSI  ID  9  on  SCSI  bus  4  had  a  single  read  error  that  was  sueeessfully  reeovered 
by  the  disk  drive’s  eontroller.  The  disk’s  eontroller  automatieally  realloeated  the  data  so  the  seetor 
in  question  will  never  be  read  again.  About  20  minutes  later,  a  different  disk,  da4  0,  with  SCSI 
ID  8  on  SCSI  bus  2,  had  a  few  unexpeeted  timeouts.  These  timeouts  eventually  led  to  the  SCSI 
driver  issuing  a  SCSI  bus  reset  in  an  attempt  to  restart  the  disk.  The  bus  reset  eleared  the  problem 
sinee  there  have  been  no  more  problems  reported  until  10:41  PM  of  the  same  day.  This  is  a  typieal 
“timeout-reeovery”  sequenee  that  is  explained  in  detail  in  Seetion  5.3.3. 

After  removing  non-essential  messages  sueh  as  login  history,  there  were  still  over  250,000 
lines  in  the  system  logs.  I  wrote  a  program  to  analyze  the  logs. 

5.1.2  HTTP  Server  Log 

HTTP  servers  write  their  own  logs.  There  is  one  file  to  reeord  aeeesses  (httpd-access  .| 
log)  and  another  for  errors  (httpd-error .  log).  Most  of  the  entries  in  the  error  log  are  for 
missing  files,  buf  fhere  is  some  informalion  sueh  as  when  fhe  HTTP  servers  were  resfarfed.  This 
log  would  have  been  useful  if  fhe  HTTP  servers  erashed  more  oflen.  However,  as  if  fumed  ouf,  our 
HTTP  servers  were  very  reliable  and  erashed  only  onee,  so  I  didn’f  have  fo  look  info  fhis  file  for 
informalion  on  reslarfs. 

5.1.3  Repair  Records 

Sinee  I  did  nol  know  I  would  be  using  fhe  repair  reeords  for  my  disserlalion  during  mosl 
of  fhe  projeel,  I  did  nol  keep  a  repair  log  per  se — ^however,  I  believe  I  have  been  able  lo  idenlify  mosl 
of  fhe  repairs  I  eondueled.  Olher  lhan  my  own  memory,  fhe  informalion  eame  from  fhe  following 
sourees. 

Mail  Messages 

I  have  kepi  all  Ihe  mail  messages  during  Ihe  eourse  of  Ihe  projeel;  Ihere  were  over  4,000 
messages  originating  from  myself  in  Ihe  arehive,  and  almost  10,000  total.  Sinee  I  was  the 
only  person  eondueting  repairs  on  the  eluster,  I  had  to  only  eheek  my  own  mail.  When  there 
was  a  problem  with  the  eluster,  I  often  sent  out  mail  to  people  in  the  Tertiary  Disk  projeet 
detailing  the  problem  and  the  lix/repair  I  eondueted.  Also,  when  there  was  an  item  sueh  as  a 
disk  enelosure  that  had  to  be  sent  baek  to  the  manufaeturer  for  repair  and  I  asked  the  seeretary 
to  talk  to  the  manufaeturer,  or  when  I  had  to  eoordinate  with  a  eontraetor  to  have  the  faulty 
Ethernet  eables  repaired,  the  exehange  was  in  the  mail  arehive. 

Post-It  Notes 

When  I  replaeed  a  bad  SCSI  disk,  I  wrote  down  the  date  and  a  brief  summary  of  the  problem 
on  a  Post-It  note  and  attaehed  it  to  the  eomponent  before  storing  it  away.  Sinee  these  disks 
were  never  sent  baek  for  repair,  I  eould  read  off  the  notes  to  get  a  summary  of  the  replaeements 
I  made. 

Combining  these  with  the  log  files  generated  automatieally,  it  was  possible  to  eorrelate 
almost  all  of  the  failures  and  repairs  with  error  messages. 


44 


5.2  Failure  Summary 

The  frequency  of  failed  and  replaced  components  we  have  seen  in  24  months  of  operation 
is  shown  in  Table  5.2.  MTBFl  shows  the  particular  component’s  individual  MTBF  (mean  time 
between  failures)  and  MTBFn  is  the  collective  MTBF  of  all  components  of  the  same  type.  Values 
of  MTBF  are  shown  in  years. 

TTR  (time  to  replace)  is  the  approximate  time  to  replace  a  component  in  minutes.  Most  of 
the  numbers  were  measured  by  taking  the  wall  clock  time  during  the  repairs  I  conducted.  Note  that 
this  is  just  the  time  it  takes  to  do  the  actual  replacement  of  parts,  intended  to  be  used  in  calculation 
of  the  amount  of  time  the  operator  has  to  be  physically  be  present  in  the  machine  room.  The  MTTR 
(mean  time  to  repair)  as  often  noted  in  literature  is  the  time  it  takes  the  operator  to  get  to  the  machine 
room  plus  the  TTR  numbers  here.  Since  our  system  is  designed  to  function  with  regularly  scheduled 
visits  by  the  operator,  we  can  assume  MTTR  is  half  the  visit  interval  for  all  components  in  which 
we  have  spare  parts — the  TTR  can  be  ignored  in  the  MTTR  calculation,  as  it  is  much  smaller  than 
the  visit  interval.  For  those  without  spare  parts,  it  will  be  a  function  of  the  visit  interval  and  the  time 
it  takes  for  replacement  parts  to  be  shipped. 

Infant  mortalities  are  not  included  in  the  table.  Initially,  there  were  6  enclosures  with 
bad  power  supplies,  9  enclosures  with  bad  SCSI  buses  and  about  20  bad  Ethernet  cables.  Also, 
components  without  any  failures,  such  as  CPU  and  enclosure  fans,  are  not  listed. 


Component 

Total 

Tailed 

Ratio 

MTBEl 

MTBEn 

TTR 

IDE  drive 

24 

7 

29.2% 

6.9 

0.29 

60 

SCSI  adapter 

44 

1 

2.3% 

88.0 

2.00 

10 

SCSI  cable 

39 

2 

5.1% 

39.0 

1.00 

2 

SCSI  drive 

364 

8 

2.2% 

91.0 

0.25 

10 

Disk  enclosure  (SCSI) 

46 

3 

6.5% 

30.1 

0.67 

60 

Disk  enclosure  (power) 

92 

3 

3.3% 

61.3 

0.67 

5 

Ethernet  adapter 

20 

1 

5.0% 

40.0 

2.00 

10 

Ethernet  switch 

2 

1 

50.0% 

4.0 

2.00 

20 

Ethernet  cable 

24 

3 

12.5% 

16.0 

0.67 

10 

Table  5.2:  Frequencies  of  failures 

Number  of  components  and  failures  we  observed  are  shown.  MTBFl  is  components’  indi¬ 
vidual  MTBF  and  MTBFn  is  the  collective  MTBF  of  all  components  of  same  type.  TTR  is 
time  it  takes  to  replace  failed  component.  MTBF  is  in  years,  TTR  is  in  minutes. 


Here  are  some  observations: 

•  The  Ethernet  switch  has  too  small  a  sample  size  to  draw  any  meaningful  conclusion.  Nonethe¬ 
less,  it  still  underlines  the  importance  of  avoiding  having  a  single  point  of  failure — a  lot  of 
our  data  would  have  been  unreachable  if  we  hadn’t  designed  the  system  in  a  way  that  there  is 
at  least  two  distinct  paths  from  any  data  to  the  outside  world. 

•  On  the  other  hand,  the  disk  enclosures’  SCSI  bus  integrity  and  IDE  hard  drives  are  major 
causes  of  concern.  As  can  be  seen  from  the  TTR  numbers,  these  are  also  the  two  hardest 


45 


components  to  replace.  We  have  investigated  methods  to  boot  the  machines  with  CD-ROMs 
as  system  disks  to  avoid  the  IDE  hard  disk  problem  altogether. 

Another  experience  worth  noting  is  that  it  is  very  hard  to  diagnose  SCSI  bus  integrity  prob¬ 
lems.  The  problem  can  be  in  any  of  the  SCSI  host  bus  adapter,  cable,  disks,  enclosures  or 
the  terminator.  In  addition,  simply  replacing  parts  one  by  one  is  often  not  enough  to  find  the 
real  cause — for  instance,  sometimes  a  faulty  disk  enclosure  backplane  will  temporarily  start 
working  again  when  a  cable  is  replaced,  giving  the  false  impression  that  a  cable  is  at  fault. 
This  behavior  is  due  to  the  nature  of  these  problems,  as  they  are  usually  caused  by  slightly 
loose  connections  somewhere  in  the  bus  and  can  come  and  go  with  the  slightest  change.  As 
can  be  seen  in  Table  5.2,  many  of  our  problems  were  caused  by  the  enclosures — we  did  not 
experience  any  SCSI  disk  failure  that  resulted  in  the  whole  SCSI  bus  being  reliable. 

•  Compared  to  the  IDE  drives,  SCSI  drives  have  been  much  more  reliable,  as  91  years  MTBE 
translates  to  about  800,000  hours.  We  suspect  the  difference  is  due  to  two  factors:  IDE  drives 
being  of  lower  quality  in  general,  and  the  superior  cooling  of  external  disk  enclosures  that 
house  the  SCSI  drives. 

Also,  we  have  been  artificially  loading  fhe  sysfem  fo  see  if  will  cause  more  componenf 
failures.  The  load  didnT  seem  fo  have  any  effecf  in  componenf  failure  rales. 

5.3  Hardware  Error  Details 

More  delails  on  errors  in  Table  5.2  are  given  in  fhis  section.  The  aclual  mefhods  of  repair¬ 
ing  various  componenfs  are  also  explained. 

5.3.1  PC 

Excepf  for  fhe  infernal  IDE  hard  drives  where  fhe  operating  system  resides,  fhe  compuf- 
ers  fhemselves  have  been  very  reliable,  wifh  no  errors  on  mofherboards,  CPU,  memory,  and  ofher 
componenfs. 

IDE  Hard  Drive 

The  internal  IDE  hard  drives  fumed  oul  fo  be  fhe  only  unreliable  pari  of  fhe  PCs.  There 
have  been  7  disks  fhal  had  fo  be  replaced.  If  happened  in  Augusl  1997,  May,  July,  September,  and 
October  (Iwo)  1998,  and  June  1999. 

If  is  nol  clear  whelher  Ihere  is  any  correlation  belween  fhese  failures.  There  appears  to  be 
a  certain  amounl  of  clustering  in  July-Ocfober  1998,  as  Ihere  were  4  failures  wilhin  a  4-monlh  span 
while  Ihere  are  only  3  more  in  fhe  remaining  20  monlhs. 

Causes  of  Failure  If  is  well  known  fhal  heal  and  vibration  are  Iwo  external  faclors  fhal  oflen 
conlribufes  to  disk  failures.  Vibration  doesnT  seem  to  be  a  problem  on  our  systems  as  fhe  PC  cases 
and  external  disk  enclosures  are  bolh  mounled  on  a  well-builf  rack,  bul  fhe  PC  cases  do  nol  have 
adequale  ventilation  in  some  areas,  particularly  around  where  infernal  disks  are  mounted.  Therefore 
I  believe  heal  is  one  of  fhe  reasons  why  so  many  IDE  hard  drives  have  failed.  This  Iheory  has  also 


46 


been  supported  by  the  faet  that  in  other  maehines  that  are  not  part  of  this  eluster,  I  have  seen  several 
SCSI  disks  mounted  inside  the  same  PC  eases  fail,  while  identieal  disks  in  external  enelosures  have 
not  exhibited  any  failures. 

I  eheeked  the  log  of  the  maehine  room’s  air  eonditioner,  but  there  wasn’t  anything  that 
indieates  an  extended  downtime  anytime  in  1998.  There  are  two  air  eonditioners  in  the  room,  so 
even  if  one  of  them  went  down,  the  room  temperature  will  still  be  eontrolled. 

A  possible  explanation  of  the  elustering  of  the  failures  is  that  all  IDE  hard  drives,  which 
were  shipped  with  the  PCs  to  our  site  together,  have  been  manufactured  at  around  the  same  time 
and  have  “aged”  together,  causing  the  failures  to  cluster  somewhat. 

The  sample  size  is  too  small  to  draw  any  conclusions,  but  it  still  might  be  a  good  idea  to 
try  purchasing  disks  from  several  sources,  possibly  from  different  manufacturers,  when  designing 
a  system  such  as  this  one,  in  order  to  reduce  the  risk  of  having  several  machines  fail  during  a  short 
period  of  time. 

In  addition  to  the  above,  there  have  been  two  more  disks  that  had  several  errors  that  didn’t 
result  in  an  immediate  death.  One  of  them  had  14  failures  in  August  1998  (this  disk  was  eventually 
replaced  in  October  after  suffering  6  more  failures),  and  another  had  5  failures  in  September  1998 
and  2  more  in  November.  The  second  disk  is  still  functional  to  this  day,  with  no  more  errors  logged 
since  then. 

Replacing  an  IDE  Drive  It  takes  about  one  hour  for  an  experienced  operator  to  replace  a  failed 
IDE  hard  drive.  About  half  of  that  is  spent  in  the  physical  act  of  replacing  the  disk,  and  the  other 
half  in  reinstalling  the  operating  system.  Here  are  the  steps  involved  in  replacing  the  disk. 

1.  Shut  down  the  machine. 

2.  Disconnect  all  cables.  The  SCSI  cables  have  two  screws  each  that  have  to  be  loosened  by  hand. 
Sometimes  the  receptacles  of  these  screws,  which  happen  to  be  screws  themselves,  come  off  the  SCSI 
adapters  with  the  cables. 

3.  Remove  the  machine  from  the  rack.  The  machine  fits  snugly  in  the  shelves  as  two  of  the  PC  cases  is 
exactly  the  width  of  the  19-inch  rack.  However,  this  also  means  pulling  out  the  machines  requires  a 
fair  amount  of  force. 

4.  Place  the  machine  on  a  flat  surface  and  open  up  two  sides  of  plastic  panels  that  can  be  easily  snapped 
off. 

5.  Turn  the  machine  upside  down  and  remove  plastic  panels  from  the  other  two  sides.  This  involves  lots 
of  practice  and  judicious  use  of  two  screwdrivers. 

6.  Remove  four  screws  that  are  holding  the  disk  drive  in  place. 

7.  Remove  the  IDE  and  power  cables  from  the  drive. 

8.  Replace  the  disk  with  a  new  one. 

9.  Reconnect  the  cables. 

10.  Reattach  the  four  screws.  Two  of  them  are  only  accessible  through  holes  roughly  2cm  by  2cm  on  the 
metal  case  and  have  to  be  inserted  into  screw  holes  about  5cm  away  than  the  metal  surface. 

1 1 .  Put  the  plastic  panels  back.  Be  careful  not  to  break  them,  as  putting  them  back  is  almost  as  hard  as 
taking  them  off  and  protruding  parts  have  to  be  inserted  in  just  the  right  order  while  sliding  the  panels 
around  to  hold  them  together. 


47 


12.  Push  the  machine  back  into  the  rack. 

13.  Reconnect  all  cables. 

14.  Power  up  the  machine. 

From  steps  5  through  11,  it  is  obvious  that  these  partieular  PC  eases  were  not  designed 
with  installation  and  deinstallation  of  internal  disks  in  mind.  If  there  is  any  lesson  learned  from  this 
experienee,  it  is  that  in  order  to  ease  the  system  administrator’s  burden,  it  is  important  to  aetually 
open  up  the  maehines  and  try  replaeing  some  parts  before  purehasing  them  in  quantities.  Raek- 
mount  PC  eases  are  generally  easier  to  handle  in  this  respeet. 

The  operating  system  installation  is  done  by  first  making  an  identieal  eopy  of  the  system 
disk  from  another  maehine,  and  then  editing  a  file  fo  give  if  fhe  righf  hosfname.  The  eopying 
lakes  aboul  half  an  hour.  By  keeping  spare  syslem  disks  ready  lo  be  swapped  in.  Ibis  wail  ean  be 
eliminated. 

5.3.2  Network 

The  universily  has  experieneed  a  few  nelwork  problems  lhal  diseonneeled  fhe  entire 
Berkeley  eampus  from  fhe  oulside  world.  Mosl  reeenl  of  Ihose  oeeurred  on  July  1,  1999,  when 
a  broken  wafer  main  eaused  fhe  maehine  room  holding  fhe  nelwork  hub  lo  be  flooded  for  a  few 
hours.  We  have  nol  had  any  problems  belween  our  elusler  and  Ihe  eampus  baekbone.  The  following 
are  are  Ihe  problems  we  had  inside  our  system. 

Ethernet  Cables 

The  Elhernel  eables  for  Ihe  elusler  were  eul  and  inslalled  on  site  by  a  eonlraeled  leehni- 
eian.  Apparenlly  Ihis  person  had  a  bad  tool  and  many  of  Ihe  eonneelors  were  nol  elamped  properly. 
Oul  of  Ihe  42  eables,  ineluding  spares,  inslalled,  aboul  20  were  nol  working  from  Ihe  beginning. 

After  Ihose  were  redone,  Ihe  eables  have  been  fairly  reliable,  wilh  only  Ihree  more  failures 
Ihe  resl  of  Ihe  way.  However,  Iwo  of  Ihose  happened  when  Ihe  maehine  room  was  reeonligured  and 
Ihe  eables  had  to  be  unplugged  temporarily.  The  impliealion  of  Ihese  reeenl  problems  is  lhal  Ihe 
“working”  eables  mighl  also  be  suffering  from  bad  eraflsmanship.  From  experienee,  I  know  lhal  Ihe 
Ihree  oul  of  24  figure  is  still  too  high  for  Elhernel  eables. 

Assuming  Ihere  are  enough  spare  eables  inslalled  on  Ihe  raeks,  replaeing  a  failed  eable 
ean  be  done  jusl  by  unplugging  and  plugging  bolh  ends  of  Ihe  eable.  In  Ihe  long  run,  Ihe  unused 
bad  eables  will  also  need  to  be  replaeed.  Wilh  ample  supply  of  eables  lhal  Ihe  operator  ean  use  to 
replaee,  il  will  nol  lake  long  to  replaee  Ihem. 

Network  Switch 

One  of  our  Iwo  16-porl  nelwork  swilehes  suffered  a  eomplele  failure  on  November  1997. 
All  hosls  eonneeled  to  lhal  swileh  were  unreaehable  until  I  nolieed  Ihe  problem  and  moved  Ihe 
eables  over  to  an  old  nelwork  hub  whieh  was  aeling  as  a  spare  al  Ihe  time. 

The  replaeemenl  swileh  arrived  five  days  after  Ihe  failure.  The  physieal  replaeemenl  of 
Ihe  swileh  was  easy,  as  il  involved  removing  jusl  Iwo  serews  in  Ihe  fronl  and  pulling  oul  Ihe  swileh. 
The  swileh  also  had  to  be  initialized  by  a  speeial  software  from  a  porlable  eompuler  eonneeled  to 
Ihe  serial  porl  of  Ihe  swileh  ehassis.  Il  took  aboul  20  minutes  in  all. 


48 


Network  Interface  Card 

The  Ethernet  eards  on  one  of  the  maehine  suddenly  stopped  working  on  August  1997  and 
had  to  be  replaeed.  The  replaeement  took  about  five  minutes.  The  steps  are  mueh  simpler  than  those 
for  IDE  hard  drives  deseribed  in  Seetion  5.3.1,  as  only  the  two  easily  removable  panels  have  to  be 
snapped  off  before  the  eard  ean  be  removed  by  loosening  two  serews. 

5.3.3  The  SCSI  Subsystem 

The  SCSI  subsystem  eonsists  of  various  eomponents.  I  have  eategorized  them  into  three 
parts,  the  disks  themselves,  eomponents  that  support  the  integrity  of  the  SCSI  bus,  and  the  host  bus 
adapters. 

SCSI  Disk 

Our  SCSI  disks  have  two  vastly  different  eharaeteristies — reliable  hardware  and  flaky 
firmware.  This  seefion  analyzes  fhose  in  defail. 

Removing  a  SCSI  disk  involves  shutting  down  fhe  maehine,  powering  off  fhe  enelosure 
and  sliding  fhe  disk  ouf.  Insfallafion  of  a  new  disk  is  jusf  fhe  opposife.  New  disks  have  fo  be  mounted 
in  eanisfers  wifh  four  serews.  If  lakes  aboul  10  minufes  in  all. 

On  a  parlially  relaled  note,  in  April  of  1999,  one  of  fhe  new  deparlmenlal  file  servers  had 
problems  friggered  by  a  disk  failure.  Several  drives  failed  in  a  ehain  reaefion  and  some  filesystems 
were  losf  even  fhough  fhe  sysfem  had  RAID-5  profeefion.  The  system  adminisfralors  have  provided 
a  defailed  write-up  on  fhe  problem  for  eireulafion  in  fhe  deparfmenf.  Aeeording  fhe  reporl,  fhe 
vendor  eonfended  fhaf  fhe  problem  was  eaused  by  disk  firmware  revision  mismalehes.  However,  I 
suspeel  Ihere  were  bugs  in  fhe  RAID  driver  or  operating  system  fhaf  were  friggered  by  fhe  firmware 
bugs  fhaf  eompounded  fhe  problem  fo  fhe  poinf  fhaf  fhe  server  had  fo  be  faken  off-line  for  a  few 
days  and  lols  of  dafa  losf. 

Hardware  There  have  been  relatively  few  SCSI  disk  failures.  Only  8  of  fhe  364  disks  in  fhe 
sysfem  had  fo  be  replaeed.  These  failures  have  happened  in  Augusf,  September,  Deeember  of  1997, 
fwo  in  Eebruary  of  1998,  and  one  eaeh  in  May,  Augusf,  and  Oelober  of  1998.  Erom  fhis  dafa,  Ihere 
does  nol  appear  fo  be  any  eluslering,  allhough  we  havenT  seen  any  failures  for  8  monlhs.  Eoad 
doesnT  seem  fo  have  any  effeel  on  SCSI  disk  failures,  as  only  1  of  fhe  8  failed  disks  were  on  fhe  8 
(ouf  of  24)  maehines  wifh  arlifieial  loads. 

Eigure  5.2  summarizes  fhe  disk  errors  I  found  in  fhe  logs.  Three  of  fhe  disk  replaeemenls 
predate  fhe  logs  so  I  eouldnT  defermine  whelher  Ihere  were  any  error  messages  before  fhe  failures 
for  Ihem.  Ineluding  fhe  lolal  failures,  Ihere  have  been  13  disks  wifh  read  errors  and  10  wifh  wrile 
errors.  None  of  Ihem  had  more  lhan  one  seelor  fhaf  was  unreadable  or  unwrilable.  All  of  fhose  were 
still  unreadable  after  refries,  resulting  fhe  kernel  reluming  an  error  fo  fhe  appliealion  requesling  fhe 
dafa.  Two  disks  had  bolh  read  and  wrile  errors,  and  9  more  eould  be  attributed  fo  bad  enelosures. 
Sublraeling  fhose  and  fhe  ones  fhaf  were  replaeed,  Ihere  are  8  disks  fhaf  had  an  unreeoverable  error 
fhaf  eanT  be  attributed  fo  anylhing  else,  bul  are  still  working  fine  fo  fhis  day. 

There  have  also  been  1,958  reeovered  read  errors  on  26  disks  and  977  reeovered  wrile 
errors  on  2  disks.  Mosl  of  Ihese  were  on  disks  fhaf  evenfually  failed. 


Disk  ID 

Unreeovered  errors 

Reeovered  errors 

Replaeed? 

(maehine:disk) 

Write 

Disk 

Enelosure 

m0:5 

431 

Y 

m0:73 

I 

m0:76 

I 

ml:41 

I 

ml:44 

5 

m2:4 

8 

Y 

m2: 8 

I 

Y 

m2:9 

10 

Y 

m2:I0 

16 

Y 

m2:I5 

I 

I 

Y 

m2:53 

I 

Y 

m2:56 

I 

Y 

m3:I3 

4 

m4:0 

I 

m4:3I 

I 

m5:0 

1313 

Y 

Y 

m5:I 

2 

Y 

m5:4 

I 

Y 

m5:8 

I 

Y 

m5:9 

I 

Y 

m5:32 

I 

Y 

m5:33 

I 

Y 

m5:45 

I 

Y 

m6:I 

I 

m7:9 

I 

I 

140 

14 

Y 

m9:40 

2 

mI0:4I 

I 

mI0:5 

2 

ml  1:2 

I 

ml  1:9 

I 

7 

963 

Y 

mI4:I 

I 

3 

Y 

mI4:4 

I 

mI4:42 

I 

mI4:45 

I 

mI5:0 

I 

mI5:I 

2 

mI6:0 

I 

mI6:2 

I 

mI6:43 

I 

mI7:33 

I 

mI7:4I 

I 

mI8:5 

3 

Figure  5.2:  SCSI  disk  read/write  errors 


50 


Firmware  There  appears  to  be  a  bug  in  the  firmware  of  our  SCSI  disks  [GM99].  In  short,  when 
there  is  heavy  traffie  on  the  SCSI  bus,  a  disk  ean  lose  traek  of  its  state  and  stop  responding  to 
eommands.  There  has  to  be  several  disks,  at  least  four  or  five,  on  the  same  SCSI  bus,  all  accessed 
heavily  at  the  same  time  for  this  to  happen.  The  only  way  to  get  the  disk  to  start  responding  again 
is  by  sending  a  SCSI  bus  reset  or  a  power  cycle  to  reinitialize  its  finite  state  machine.  We  had  the 
SCSI  driver  modified  by  the  FreeBSD  SCSI  team  to  try  to  detect  a  hanging  disk  and  restart  it  with  a 
bus  reset,  which  enabled  us  to  fix  most  of  the  potential  problems,  but  there  are  still  some  instances 
which  couldn’t  be  corrected  automatically.  As  shown  in  Figure  5.1,  these  appear  in  the  log  as  SCSI 
timeouts. 

Almost  two  thirds  of  the  364  SCSI  disks  (235)  in  the  system  has  suffered  from  this  tempo¬ 
rary  amnesia.  There  are  6,540  timeout  messages  recorded  in  the  logs.  Some  of  these  are  benign,  but 
others  are  more  serious.  The  driver  gave  up  on  23  of  these  cases,  causing  the  kernel  to  stop  sending 
commands  to  those  disks.  The  only  way  to  access  the  disk  again  was  by  rebooting  the  machine. 
Many  others  caused  access  to  the  disks  hanging  indefinitely,  sometimes  hanging  or  crashing  the 
operating  system  in  the  process. 

Since  this  particular  problem  is  quite  specific  to  our  system,  and  can  be  avoided  by  thor¬ 
oughly  testing  the  disks  before  selecting  a  particular  brand  for  purchase  or  upgrading  the  firmware 
on  disks  as  necessary,  I  have  chosen  to  ignore  them  for  the  purpose  of  my  simulation. 

By  the  way,  in  order  to  upgrade  the  disk  firmware,  the  machines  have  to  be  booted  into 
MS-DOS  to  run  a  program  supplied  by  the  vendor.  Since  most  of  our  machines  don’t  have  key¬ 
boards  or  monitors  connected,  this  involves  moving  many  disks  around  and  with  the  SCSI  enclosure 
problems  we  were  having,  we  were  afraid  this  could  cause  more  problems  than  it  will  solve.  This 
awkwardness  is  the  reason  why  we  have  not  upgraded  the  firmware  on  our  disks  despite  occasional 
crashes  it  causes. 

SCSI  Bus 

There  are  three  components  that  can  cause  SCSI  bus  integrity  problems — cables,  enclo¬ 
sures  and  termination.  Termination,  either  in  the  form  of  external  terminators  or  internal  to  the  SCSI 
host  bus  adapters,  has  never  failed  on  our  system. 

SCSI  Enclosures 

The  SCSI  enclosures  has  been  the  biggest  disappointment  of  our  system.  As  shipped,  nine 
of  the  enclosures  had  SCSI  bus  integrity  problems.  During  the  operation  of  the  system,  three 
more  developed  similar  problems,  one  in  May  1998  and  two  in  November,  and  had  to  be  sent 
back  to  the  manufacturer  for  repairs.  There  is  at  least  one  more  enclosure  in  the  cluster  that 
sometimes  develops  problems  when  a  disk  is  unplugged  and  plugged  back  in. 

Replacing  SCSI  enclosures  take  at  least  two  people  to  lift  them.  It  is  ideal  to  have  three  as 
the  enclosures  are  a  little  wider  than  the  racks  and  thus  the  racks  have  to  be  pushed  aside  by 
screwdrivers  while  inserting  the  enclosures.  In  addition,  the  disks  have  to  be  moved  from  the 
old  enclosure  to  the  replacement,  as  well  as  two  SCSI  cables  and  four  power  cables  reinstalled. 

Three  of  the  disk  enclosure  power  supplies  died  during  operation.  Six  more  were  bad  as 
shipped,  or  died  shortly  after  installation.  Since  our  enclosures  have  hot-swappable  dual 
power  supplies,  none  of  these  affected  operation  of  the  system.  Replacing  the  power  supplies 
are  very  simple,  involving  just  one  screw  and  reinstalling  the  power  cable. 


51 


SCSI  Cables 

The  external  SCSI  eables  have  been  fairly  reliable,  with  only  two  having  to  be  replaeed  in 
two  years.  They  ean  be  replaeed  very  easily. 

SCSI  Host  Bus  Adapters 

Only  one  of  the  SCSI  host  bus  adapters  failed  on  our  system.  One  of  the  ehips  on  the 
adapter  literally  burned  out.  SCSI  adapter  eards  are  as  easy  as  the  network  eards  to  replaee. 

5.4  Software  Errors 

Compared  to  hardware,  the  software  on  our  systems  has  been  remarkably  reliable. 

5.4.1  Operating  System 

As  far  as  I  ean  tell,  there  have  been  no  erashes  due  to  a  bug  in  the  operating  system  during 
the  period.  Even  though  1  don’t  know  the  exaet  eause  of  all  the  erashes  and  hangs,  as  many  of  those 
didn’t  leave  a  log  or  erash  dump  that  1  ean  analyze,  1  believe  all  the  OS  erashes  ean  be  attributed  to 
disk  firmware  problems  for  the  following  reason. 

One  of  the  maehines,  stampede,  is  an  NFS  server  for  the  entire  Tertiary  Disk  eluster  as 
well  as  some  other  maehines  in  the  department.  In  addition  to  being  an  NFS  server,  stampede 
runs  an  HTTP  proxy,  a  small  anonymous  FTP  server  and  mailing  list  server.  Stampede  has  been 
used  mueh  more  heavily  than  of  the  maehines  in  the  eluster,  was  erashing  almost  daily  when  it  had 
the  same  disks  as  others.  Stampede  has  not  had  a  single  erash  in  the  two  years  sinee  1  replaeed 
the  disks  with  those  from  another  manufaeturer. 

Also,  1  have  been  watehing  the  eonsole  messages  of  some  maehines  on  their  serial  ports 
as  they  either  hung  or  erashed.  In  every  one  of  the  dozen  or  so  eases  1  have  observed,  there  was  a 
message  indieating  a  disk  firmware  problem  jusf  before  fhe  maehine  hung  or  erashed. 

5.4.2  Application 

The  HTTP  server  has  died  11  limes  during  fhe  30-monlh  period.  10  of  Ihose  were  bus 
errors  and  1  was  an  aborl  signal.  All  of  fhe  bus  errors  happened  when  fhe  operating  system  was 
upgraded  while  fhe  HTTP  servers  were  running.  Sueh  erashes  usually  resull  when  a  shared  library 
is  ehanged  while  an  appliealion  is  running — fhe  in-memory  version  of  fhe  shared  library  ean  be  oul 
of  syne  wilh  fhe  version  on  disk.  Sinee  fhe  maehines  are  subsequenlly  rebooled,  Ihese  errors  were 
nol  repeated. 

As  fhe  fronlends  were  inslrueled  nol  lo  forward  requesls  lo  maehines  lhal  were  being 
upgraded,  none  of  Ihese  were  visible  lo  Ihe  users.  This  leaves  only  one  erash,  Ihe  one  eaused  by  Ihe 
aborl  signal,  as  a  legitimate  erash  of  Ihe  HTTP  server  eaused  by  a  software  bug. 

This  resull  is  in  slark  eonlrasl  lo  Jim  Gray’s  1985  sludy  on  failures,  in  whieh  he  observed 
lhal  25%  of  system  failures  are  eaused  by  software  [Gra85].  This  differenee  is  probably  due  lo  bolh 
Ihe  operating  system  and  appliealion  on  our  site  being  relatively  slable  versions.  If  we  eounled  Ihe 
number  of  appliealion  erashes  during  Ihe  period  when  if  was  still  under  heavy  development  Ihe 
system  would  have  had  a  mueh  higher  software  failure  rate. 


52 


5.5  Summary 

We  have  been  observing  our  system  since  we  constructed  it.  Records  of  replacements, 
which,  with  system  logs  of  the  operating  system,  were  analyzed  to  determine  the  failure  rate  of  each 
component.  Although  we  have  lost  some  of  the  logs,  I  believe  we  were  able  to  determine  the  failure 
rates  with  good  accuracy. 

The  SCSI  disks’  failure  rates  were  consistent  with  what  were  advertised  by  the  manu¬ 
facturer;  those  of  IDE  disks  and  external  disk  enclosures  were  not.  There  didn’t  appear  to  be  any 
correlation  between  separate  failures.  The  operating  system  and  applications  were  surprisingly  re¬ 
liable. 

The  data  collected  in  this  chapter  is  used  in  the  next  chapter  as  default  parameters  for  the 
simulator  to  determine  the  availability  of  our  system  and  variations  of  it  over  a  long  period  of  time. 


53 


Chapter  6 

Validation 


I  ran  experiments  to  validate  the  feasibility  of  our  approach.  The  objective  was  to  prove 
that  the  system  will  actually  run  continuously  and  flawlessly  in  the  existence  of  a  variety  of  failures. 
A  simulator,  in  conjunction  with  an  event  generator,  was  used  to  calculate  the  expected  downtime 
and  time  between  failures  depending  on  several  parameters  such  as  component  failure  rates  and 
repair  interval.  By  running  a  simulation  over  a  long  period  of  time,  I  was  able  to  obtain  a  good 
estimate  of  availability  and  times  before  failures. 

The  simulator  models  the  system  as  a  directed  acyclic  graph  (DAG)  of  nodes  and  arcs. 
A  node  represents  either  a  hardware  component,  such  as  a  disk,  or  an  abstract  concept,  such  as 
“data”  or  “users”.  The  system  is  functional  if  all  the  “data”  are  connected  to  the  “users”.  Using  this 
method,  I  reduced  the  availability  question  into  a  simple  graph  connectivity  problem  of  the  DAG.  I 
believe  this  novel  approach  is  an  advance  over  a  more  detailed  model  of  the  system. 

Figure  6. 1  shows  a  simple  example.  Data  is  mirrored  on  two  disks  on  two  different  ma¬ 
chines.  Users  are  on  the  same  local  Ethernet  segment  as  the  servers.  Power  supply  and  cable  failures 
are  ignored.  There  are  two  paths  from  Data  to  Users.  If  either  one  of  Route  A  or  B  is  complete,  the 
system  is  functional.  However,  if  both  Route  A  and  B  are  disconnected — for  instance,  if  PC  A  and 
Disk  B  are  broken  at  the  same  time — then  the  system  has  suffered  an  outage. 

There  are  two  types  of  nodes:  Data  nodes  allow  data  to  flow  through  them;  power  nodes 
are  power  supplies  that  don’t  carry  data  but  instead  control  functionality  of  other  nodes.  Nodes  as¬ 
sociated  with  hardware  components  will  sometimes  break  and  affect  the  availability  of  the  system. 
Table  6. 1  classifies  various  types  of  simulator  nodes.  Various  abstract  concepts  and  their  implemen¬ 
tation  in  the  simulator  are  explained  in  Section  6.1.2. 


Type 

Represents 

Examples 

Data 

Hardware 

SCSI  disk,  SCSI  adapter,  SCSI  cable, 
Ethernet  cable,  Ethernet  switch 

Abstract 

data,  users,  striped  set,  dataset 

Power 

Hardware 

UPS  unit 

Table  6.1:  Simulator  nodes 


In  my  simulator,  an  arc  generally  denotes  a  connection  where  data  can  flow  from  one 


54 


Users 


Ethernet  Switch 


PC 


Disk 


I 


/ 

A 


\ 

/  \  \ 


B 


f  f 


Route  A  ' 


/  Route  B 


Data 


Figure  6.1:  A  simple  connectivity  graph 


node  to  another.  Some  arcs  convey  other  special  information,  such  as  “electric  power”.  Arcs  never 
break  in  my  simulator.  Note  that  “cables”  in  our  system  are  represented  as  nodes,  not  arcs;  for 
instance,  Ethernet  cables  connecting  Ethernet  adapter  cards  on  individual  PCs  to  the  central  switch 
are  implemented  as  in  Figure  6.2.  By  using  nodes  to  represent  cables,  it  became  possible  to  simulate 
cables  with  multiple  endpoints,  such  as  double-ended  SCSI  buses. 

The  failure  data  mentioned  in  Chapter  5  was  used  to  design  the  input  to  the  simulator. 
In  addition  to  running  the  simulator  on  a  configuration  similar  to  ours,  I  have  also  tried  different 
configurations,  such  as  with  or  without  double-ending,  different  number  of  disks  per  PCs,  and  so 
on.  Some  simulations  were  done  to  answer  “questions”  in  the  form  of  English  sentences — for 
instance,  “what  will  it  take  to  build  a  system  with  99.999%  availability?”  They  are  all  described 
later  in  this  chapter. 

6.1  Simulation  Setup 

I  wrote  an  event-driven  simulator  to  use  the  data  we  collected  to  experiment  with  various 
design  parameters.  The  parameters  include:  different  repair  intervals,  systems  with  and  without  disk 
striping,  systems  with  and  without  fast  reboot  optimizations,  different  disk  failure  rates,  and  so  on. 
The  simulator  also  allowed  us  to  investigate  radically  different  designs,  such  as  having  significantly 
more  disks  per  machine  or  not  having  double-ending. 

The  simulation  setup  consists  of  two  parts:  the  event  generator  and  the  simulator.  The 
generator  generates  random  events  based  on  parameters  given  to  it.  The  simulator  reads  the  output 


55 


Figure  6.2:  Ethernet  eables 


of  the  generator  and  emulates  the  system’s  behavior.  The  simulator  is  designed  to  analyze  the  flow 
of  data  in  the  system  as  a  graph  eonneetivity  problem  and  deteet  possible  outages. 

6.1.1  The  Generator 

The  design  goal  of  the  generator  was  to  be  able  to  generate  events  in  a  realistie  manner 
aeeording  to  the  reliability  observations  we  have  made  on  our  system  over  the  eourse  of  the  projeet. 
The  events  follow  the  exponential  distribution  and  are  not  eorrelated  to  eaeh  other. 

Generator  Parameters 

Table  6.2  shows  general  command-line  options  to  the  generator.  The  random  seed  is 
used  as  an  argument  to  the  s  random  { )  function  call  to  set  the  initial  seed  to  the  pseudo-random 
number  generator.  By  setting  this  to  a  specific  value,  we  can  repeat  the  simulation  with  identical 
set  of  events.  By  default,  the  pseudo-random  number  generator  is  initialized  by  the  process  ID  of 
the  generator  process,  making  the  output  different  every  time  when  it  is  run  repeatedly.  The  -size 
option  will  dramatically  expand  the  size  of  the  system;  it  is  used  in  Section  6.3.11. 

Table  6.3  shows  the  options  to  specify  the  number  of  components  and  their  average  life¬ 
time.  Note  that  the  average  lifetimes  are  for  individual  components;  for  example,  with  512  disks, 
there  will  be  approximately  five  failures  per  year  if  each  disk  has  a  90-year  average  lifetime  with 
exponential  distribution.  It  is  also  possible  to  specify  the  number  of  each  component.  Numbers  of 
some  components  are  related;  for  instance,  the  number  of  Ethernet  adapters  are  equal  to  the  num¬ 
ber  of  PCs,  and  the  number  of  SCSI  cables  are  twice  of  the  number  of  disk  enclosures  because  of 
double-ending.  The  related  components  are  denoted  by  “xn”  in  the  table,  and  their  numbers  can  be 
changed  together  by  specifying  only  one  option.  I  rounded  up  the  default  number  of  components  to 
the  closest  power  of  2  from  those  of  our  system  to  make  it  easier  to  test  different  configurations. 

As  mentioned  above,  most  of  the  default  failure  rates  are  taken  from  data  described  in 
Chapter  5.  The  exception  to  this  is  the  Ethernet  switch,  which  had  one  failure  out  of  two  compo¬ 
nents,  or  a  50%  failure  rate  in  three  years.  It  is  obvious  that  the  sample  size  is  too  small  to  draw  any 
meaningful  conclusions  about  them.  I  assigned  a  default  failure  rate  of  zero  to  Ethernet  switches. 


56 


Option 

Deseription 

Default 

-d 

Duration  of  simulation 

100,000  years 

-s 

Random  seed 

proeess  ID 

-ri 

Repair  interval  (visit  frequeney) 

1  week 

-rt 

Reboot  time 

60  seeonds 

-size 

Multiply  size  of  entire  system  by  integer 

1 

Table  6.2:  Generator  general  options 


Number 

Average  lifetime 

Component 

Option 

Default 

Option 

Default 

Disk 

-nd 

512 

-da 

90  years 

PC 

-npc 

32 

-pa 

7  years 

Operating  System 

-npc 

32 

-osa 

oo 

UPS 

-nups 

16 

-ua 

oo 

SCSI  adapters 

-nenc  x2 

64 

-saa 

90  years 

SCSI  eables 

-nenc  x2 

64 

-sea 

40  years 

Disk  enelosures  (SCSI) 

-nenc 

32 

-esa 

30  years 

Disk  enelosures  (power) 

-nenc  x2 

64 

-epa 

60  years 

Ethernet  adapters 

-npc 

32 

-eaa 

40  years 

Ethernet  eables 

-npc 

32 

-eca 

16  years 

Ethernet  switehes 

-nes 

2 

-ewa 

oo 

Table  6.3:  Generator  eomponent  options 


57 


B/R  name  number  time 


B 

ether adapter 

11 

2546387 

R 

ether adapter 

11 

3024000 

B 

PC 

0 

4345488 

R 

PC 

0 

4838400 

B 

ethercable 

16 

5459964 

R 

ethercable 

16 

6048000 

B 

disk 

153 

9387963 

B 

encpower 

20 

9436074 

R 

disk 

153 

9676800 

R 

encpower 

20 

9676800 

B 

encscsi 

26 

10826452 

R 

encscsi 

26 

10886400 

Figure  6.3:  Sample  generator  output 


along  with  UPS  units,  whieh  did  not  fail  during  our  projeet’s  duration,  as  I  did  not  want  to  assign  an 
arbitrary  number  for  something  we  did  not  know.  The  effeet  of  these  parameters,  as  well  as  others, 
are  explained  in  Seetion  6.3.  The  failure  rates  of  Ethernet  switehes  and  UPS  units  did  not  have  mueh 
effeet  on  the  the  overall  performanee  of  the  system. 

We  did  not  observe  any  operating  system  erashes,  but  by  speeifying  the  -os a  option,  it 
is  possible  to  simulate  randomly  erashing  operating  systems.  The  -rt  option  in  Table  6.2  ean  be 
used  to  speeify  the  amount  of  time  it  takes  for  the  operating  system  to  reeover  from  a  erash  with  a 
reboot. 

All  failed  eomponents  exeept  the  operating  system  are  assumed  to  be  repaired  when  the 
seheduled  operator  visit  oeeurs — this  is  not  an  unreasonable  assumption  sinee  all  that  are  needed 
are  enough  spare  eomponents.  The  repair  interval  is  speeified  by  the  -  ri  option. 

Generator  Output 

Figure  6.3  shows  some  lines  from  a  sample  run  of  the  generator.  The  fields  are,  from  left 

to  right, 

break/repair  component-name  component-number  time 

The  first  field  is  “B”  when  a  eomponenf  breaks  and  “R”  when  if  is  repaired.  The  nexf  field 
is  fhe  fype  of  fhe  eomponenf  and  and  infeger  idenfifieafion  number.  The  lasf  field  is  fhe  number  of 
seeonds  sinee  fhe  simulafion  has  sfarfed.  For  insfanee,  fhe  firsl  line  of  Figure  6.3  means  “Efhernef 
adapter  eard  number  11  broke  2,546,387  seeonds  after  fhe  simulafion  was  sfarfed.” 

6.1.2  The  Simulator 

The  ofher  program,  fhe  simulafor,  reads  fhe  oufpuf  generated  by  fhe  generafor  and  defer- 
mines  fhe  system’s  behavior.  The  simulafor  keeps  fraek  of  fhe  failures  and  nofes  fhe  fael  when  some 


58 


Figure  6.4:  OR-node 


of  the  data  beeomes  unreaehable.  At  the  end,  it  will  print  out  a  summary  of  the  system’s  availability. 

Simulator  Architecture 

The  simulator  treats  the  system  as  a  DAG.  A  node  of  the  graph  is  either  a  hardware  eompo- 
nent,  sueh  as  a  disk,  or  an  abstraet  eoneept  sueh  as  “data”  or  “users”.  Some  hardware  eomponents, 
sueh  as  power  supplies,  have  more  than  one  eorresponding  node  in  order  to  model  the  behavior 
better.  An  are  generally  denotes  a  eonneetion  where  data  ean  flow  from  one  node  to  another.  Some 
ares  eonvey  other  speeial  information,  sueh  as  “eleetrie  power”.  As  mentioned  above,  the  system  is 
funetional  if  all  the  “data”  are  eonneeted  to  the  “users”. 

Nodes  There  are  two  types  of  nodes  in  the  system.  Most  of  the  nodes  are  data  nodes,  whieh  denote 
plaees  through  whieh  data  ean  flow.  There  is  also  a  speeial  type  of  node  for  uninterruptible  power 
supplies,  ealled  power  nodes.  Power  nodes  don’t  handle  data;  they  supply  power  to  data  nodes. 

Data  Nodes  Data  nodes  earry  data.  There  are  two  types  of  data  nodes,  OR-nodes  and  AND-nodes, 
depending  on  how  failures  of  a  subset  of  its  deseendants  are  handled. 

OR-Nodes  The  double-ended  SCSI  eable  in  Figure  6.4  is  an  OR-node.  From  an  OR-node, 
the  data  ean  flow  in  either  direetion.  If  the  data  eventually  gets  to  the  destination  from 
any  of  the  direet  deseendants,  the  OR-node  is  said  to  be  eonneeted  to  the  destination.  I 
use  the  “plus-in-eirele”  symbol  to  denote  an  OR-node  in  the  figures. 

AND-Nodes  The  striped  virtual  disk  in  Figure  6.5  is  an  AND-node.  An  AND-node  is  eon¬ 
neeted  to  the  destination  only  if  all  of  the  immediate  deseendants  of  the  node  are  eon¬ 
neeted  to  the  destination  node.  In  partieular,  if  any  of  the  immediate  deseendant  of  an 
AND-node  fails,  the  AND-node  itself  will  be  diseonneeted.  I  use  the  “times-in-eirele” 
symbol  to  denote  an  AND-node  in  the  figures. 

Most  data  nodes  of  the  Tertiary  Disk  system  graph  are  OR-nodes.  The  only  eomponents  that 
are  pure  AND-nodes  in  our  eonfiguration  are  data  and  striped  sets.  Note  that  if  a  node  has  only 


59 


Individual  Disks 


one  descendant,  it  can  be  called  either  an  OR-  or  AND-node  and  have  the  exact  same  behavior. 
In  such  cases,  I  have  omitted  the  circle  symbols. 

Power  Nodes  Power  nodes  represent  uninterruptible  power  supplies.  They  also  have  arcs  that 
connect  them  to  other  components  but  these  arcs  don’t  represent  flow  of  data — they  are  power 
cables.  If  a  power  node  goes  down,  it  will  take  down  all  the  components  connected  to  it. 

Simulator  Details 

As  mentioned  in  the  previous  section,  the  simulator  builds  a  DAG  and  analyzes  its  con¬ 
nectivity  to  determine  the  system’s  functionality.  However,  the  DAG  does  not  simply  reflect  the 
connectivity  of  the  hardware  of  the  system.  Some  of  the  nodes,  such  as  data  and  users,  are  abstract 
concepts  that  don’t  have  corresponding  hardware;  some  others,  such  as  redundant  power  supplies, 
require  special  handling  to  model  their  behavior  with  a  simple  DAG.  This  section  explains  how 
some  parts  of  the  system  are  represented  in  our  simulator. 

Data  The  data  is  divided  into  datasets  and  spread  around  the  entire  cluster.  By  default,  datasets 
are  mirrored  and  reside  on  striped  sets  by  default.  If  any  of  the  datasets  are  unavailable,  the 
system  is  said  to  have  suffered  an  outage. 

To  model  this,  I  use  a  single  AND-node  called  “data”  that  has  all  the  datasets  as  immediate 
descendants.  The  connectivity  from  the  datasets  depend  on  the  configuration  of  the  system. 
Figure  6.6  shows  how  the  data  is  connected  to  the  striped  sets  in  the  default  configuration 
using  mirroring.  Note  that  the  data  is  an  AND-node  while  the  datasets  are  OR-nodes. 

Striped  Sets  As  shown  in  Figure  6.5,  a  striped  set  of  multiple  disks  can  be  represented  as  an  AND- 
node  with  all  the  disks  in  the  set  as  its  descendants.  A  concatenation  set,  where  the  individual 
disks  are  serially  concatenated  instead  of  having  data  striped  across  them,  is  similar — the 
only  difference  with  a  striped  set  is  the  actual  data  layout  on  the  physical  disks,  which  has  no 
bearing  upon  the  connectivity  of  the  system. 


60 


Striped  sets 


Figure  6.6:  Data,  datasets  and  striped  sets 


It  is  worth  pointing  out  that,  as  an  extension,  a  parity-based  RAID-5  array  can  be  represented 
as  a  variation  of  an  AND-node  where  all  but  one  of  the  descendants  have  to  be  connected  for 
the  node  to  be  functional.  Another  way  to  look  at  this  is  to  count  the  number  of  descendants 
that  have  to  be  connected  for  a  certain  node  to  be  connected.  Let’s  call  this  number  the 
connectivity  factor.  For  a  node  with  N  descendants,  an  AND-node  has  a  connectivity  factor 
of  N,  an  OR-node  has  1  and  a  RAID-5  node  has  —  1.  By  making  this  an  arbitrary  number, 
we  can  simulate  other  designs  such  as  P-i-Q  RAID,  which  has  a  connectivity  factor  of  —  2. 
See  Table  6.4  for  a  summary. 


Name 

Connectivity  factor 

Comment 

Mirror 

1 

OR-node 

Striped  set 

N 

AND-node 

Concatenated  set 

N 

AND-node 

RAID-5 

N  -  1 

P-tQ  RAID 

N  -2 

Table  6.4:  Connectivity  factors 


Power  Failures  A  power  supply  is  represented  by  adding  an  extra  node  behind  an  existing  one  as  in 
Figure  6.7.  This  extra  node  is  added  even  if  we  don’t  simulate  failures  for  this  particular  type 
of  power  supply  to  make  the  graph  easy  to  construct  and  understand.  Power  supply  nodes  are 
connected  to  the  UPS  (power)  nodes. 

Redundant  Power  Supplies  Redundant  power  supplies  such  as  those  found  in  our  SCSI  enclo- 


61 


Node 


Power  Supply 


UPS 


Figure  6.7:  Power  supplies 


sures  can  be  emulated  by  dividing  the  nodes  into  sub-nodes  as  shown  in  Figure  6.8.  The 
enclosure  itself  is  duplicated  in  two  places — the  in-node  (frontend)  and  the  out-node  (back¬ 
end).  Data  flows  into  the  in-node  and  out  of  the  out-node.  Between  the  in-node  and  the 
out-node  are  the  power  supplies.  The  dotted  line  that  surrounds  the  in/out-nodes  and  power 
supplies  is  the  enclosure  as  seen  from  the  outside. 


Enclosure 


Figure  6.8:  Redundant  power  supplies 


The  “real”  enclosure  can  be  thought  as  being  either  the  in-node  or  the  out-node,  but  it  is 
probably  easiest  to  consider  the  entire  contents  of  the  dotted  lines  to  be  the  enclosure.  The 
important  characteristic  of  this  combined  node  is  that  even  if  either  of  the  power  supply  or 
UPS  fails,  data  still  flows  in  and  out  of  the  enclosure,  while  if  both  power  supplies  lose  power, 
the  enclosure  will  be  dysfunctional. 

In  my  simulator,  an  enclosure  failure  is  treated  as  an  in-node  failure. 


62 


Simulator  Parameters 

By  default,  the  simulator  will  eonstruet  a  system  with  32  PCs  and  512  disks  in  a  double- 
ended  eonfiguration.  For  simplieity,  16  disks  are  put  in  one  enelosure,  although  a  real  wide-SCSI 
bus  ean  only  address  16  deviees,  and  thus  eannot  support  16  disks  in  addition  to  the  two  SCSI 
adapters  required  for  double-ending  on  one  bus.  Double-ending  and  mirroring  ean  be  turned  off  by 
command-line  options.  Note  that  the  -mirror  option  actually  specifies  the  number  of  replicated 
copies  of  data:  -mirror  1  will  turn  mirroring  off;  -mirror  4  will  specify  a  4-way  mirror. 

Table  6.5  shows  all  the  parameters  to  the  simulator.  By  default,  the  simulator  runs  until 
the  input  from  the  generator  is  exhausted,  but  it  can  be  stopped  before  the  end  of  input  by  specifying 
the  -d  option.  Verbose  descriptions  of  each  outage  can  be  generated  with  the  -v  option.  It  is  also 
possible  to  specify  numbers  of  each  individual  components.  As  was  the  case  with  the  generator, 
some  options  change  the  number  of  more  than  one  component;  for  instance,  if  you  specify  -nenc 
1 6  with  double-ending,  there  will  be  16  enclosures,  32  enclosure  power  supplies,  32  SCSI  cables 
and  32  SCSI  adapters.  The  -  size  option  will  make  the  entire  system  larger  by  simply  multiplying 
the  number  of  all  components. 


Opfion 

Description 

Defaull 

-d 

Duration  of  simulafion 

until  end  of  inpuf 

-nde 

Turn  off  double-ending 

false  (enable  double-ending) 

-V 

Turn  on  verbose  oufpuf 

false 

-mirror 

Mirroring  faclor 

2 

-nd 

Number  of  SCSI  disks 

512 

-need 

Number  of  disk  sfripe  sefs 

32 

-nene 

Number  of  disk  enclosures 

32 

-npe 

Number  of  PCs 

32 

-nups 

Number  of  UPS  unifs 

16 

-size 

Multiply  size  of  entire  system 

1 

Table  6.5:  Simulator  parameters 


Simulator  Output 

The  simulator,  without  the  -v  option,  outputs  just  one  line  of  summary  information  such 
as  the  following: 

185  breaks  (9678  h)  ave  52  h  (0.97  h/y) ,  avail:  99.98895%,  MTBF  54.05  y  (473514  h)| 

This  line  means  the  simulator  detected  185  outages  totaling  9,678  hours  over  10,000  years  for  an 
average  of  52  hours  per  outage  and  0.97  hours  of  outage  per  year.  The  availability  of  the  system 
was  99.98895%,  and  the  mean  time  between  failures  (MTBF)  was  54.05  years,  or  473,514  hours. 

The  most  interesting  numbers  here  are  the  MTBF  and  outage  hours  per  year.  One  goal 
of  our  system  design  was  to  create  a  large-scale  storage  system  with  an  MTBF  figure  comparable 
fo  a  single  disk.  Wifh  manufaclurers’  quofed  MTBF  figures  for  high-end  SCSI  disks  being  a  few 


63 


hundred  thousand  to  a  million  hours  (about  30  to  110  years),  a  number  with  six  to  seven  digit  hours 
here  means  our  system  has  attained  the  goal. 

On  the  other  hand,  the  outage  hours  per  year  is  more  intuitive  and  is  easier  to  understand 
and  eompare.  We  eannot  eompare  this  number  with  disks — onee  disks  fail,  they  generally  stay 
failed,  so  disk  drive  manufaeturers  don’t  quote  outage  hours  per  years  for  disks — but  we  ean  eom¬ 
pare  it  with  high-availability  systems.  For  instanee,  Jim  Gray  mentions  in  a  paper  that,  in  1989, 
Tandem  systems  had  system  MTBF  of  21  years  and  a  MTTR  of  10  hours,  whieh  translates  to  about 
0.5  hours  of  outage  per  year  [Gra90].  Also,  some  manufaeturers  elaim  99.999%  availability  for 
their  server  systems.  That  translates  to  about  0.09  hours,  or  5  minutes,  of  outage  per  year. 

6.2  Alternatives  to  Simulation 

Although  the  highly  redundant  design  makes  our  system  hard  to  approaeh  analytieally, 
there  are  alternatives  to  event-driven  simulation.  One  of  them  is  to  use  a  Bayesian  network.  Al¬ 
though  Bayesian  networks  are  primarily  intended  for  statie  probability  analysis,  it  is  possible  to 
employ  them  in  a  time-based  problem  like  ours.  However,  for  a  system  with  a  few  hundred  nodes, 
the  eomputational  eomplexity  of  estimating  the  availability  using  Bayesian  networks  is  very  high. 

There  are  some  additional  problems  with  using  Bayesian  networks  to  evaluate  our  system. 
One  is  that  the  operator  visiting  to  repair  failed  eompanents  means  we  ean  no  longer  ealeulate  the 
probability  of  two  eomponents  failing  at  the  same  time  as  a  produet  of  the  probability  of  eaeh 
eomponent  failing;  the  failure  intervals  of  eomponents  are  no  longer  independent.  Also,  with  an 
event-driven  simulator,  it  is  trivial  to  explore  additional  repair  polieies — for  instanee,  the  operator 
visits  only  when  there  are  two  or  more  failed  eomponents — ^but  sueh  additional  dependeneies  will 
make  the  transition  matrix  of  a  Bayesian  network  mueh  more  eomplieated.  Therefore,  we  deeided 
to  use  an  event-driven  simulator  to  evaluate  the  system. 

6.3  Simulation  Results  and  Exploration 

The  simulation  was  run  under  several  eonditions.  First,  the  behavior  of  the  base  system, 
with  hardware  eonfiguration  and  failure  parameters  similar  to  the  system  we  have,  is  presented  and 
analyzed.  Subsequently,  variations  to  the  systems  are  introdueed. 

The  simulator  was  used  to  answer  several  questions  to  further  understanding  about  our 
design  and  ehoiee  of  arehiteeture.  Here  are  the  questions  with  their  eorresponding  seetion  numbers. 

6.3.1  What  are  the  availability  eharaeteristies  of  our  default  system? 

6.3.2  What  are  the  effeets  of  individual  eomponent  failure  rates? 

6.3.3  How  mueh  time  does  the  operator  aetually  spend  eondueting  repairs?  Is  there  some  way  to 
sehedule  repairs  more  effieiently? 

6.3.4  How  mueh  will  the  repair  interval  affeet  reliability? 

6.3.5  What  is  the  effeet  of  disk  striping? 

6.3.6  How  valuable  is  mirroring? 


64 


6.3.7  How  about  RAID-5? 

6.3.8  How  valuable  is  double-ending  of  SCSI  disks? 

6.3.9  Will  operating  system  erashes  affeet  availability?  How  mueh  will  reboot  time  after  a  erash 
affeet  availability? 

6.3.10  What  is  the  effeet  of  deereasing  or  inereasing  the  number  of  disks  per  PC? 

6.3.11  How  will  a  system  10  times  as  big  as  the  eurrent  one  behave?  How  ean  I  make  it  as  reliable 
as  the  eurrent  one? 

6.3.12  What  will  it  take  to  build  a  system  with  99.999%  availability  given  the  eurrent  eomponent 
failure  rates?  What  about  99.9999%? 

The  following  seetions  will  explore  these  questions  in  detail. 

6.3.1  Overview 

“What  are  the  availability  characteristics  of  our  default  system?’’ 

The  default  system  had  a  MTBF  of  630,000  hours  (70  years)  and  an  availability  of 
99.990%,  or  0.8  hours  of  downtime  per  year.  The  average  downtime  per  failure  was  57  hours. 
In  other  words,  on  average,  the  system  will  go  down  onee  every  seven  deeades,  and  when  it’s  down, 
it  takes  about  two  and  a  half  days  on  average  to  reeover.  Note  that  the  operator  eomes  in  to  repair 
any  failed  eomponent  every  seven  days — the  time  to  reeover  is  obviously  quite  strongly  affeeted  by 
the  repair  interval,  as  ean  be  seen  below. 

6.3.2  Failure  Rates 

“What  are  the  effects  of  individual  component  failure  rates?’’ 

The  failure  rates  of  eaeh  eomponent  has  differing  effeets  on  the  behavior  of  the  entire 
system.  Let’s  take  a  look  at  how  they  affeet  the  overall  system  reliability. 

SCSI  Drives 

By  far  the  largest  portion  of  failures  eome  from  SCSI  drives.  Figures  6.9  and  6.10  show  the 
relationship  between  SCSI  drive  failure  rates  and  availability.  As  mentioned  above,  with  the 
disk  failure  rates  we  observed,  one  failure  every  90  years,  the  outage  per  year  is  just  under  1 
hour,  and  the  MTBF  is  70  years.  Manufaeturers  often  quote  1  million  hours  MTBF  for  their 
high-end  SCSI  drives;  that  translates  to  about  1 14  years,  so  our  whole  system  MTBF  is  elose 
to  that. 

As  ean  be  seen  from  the  graphs,  if  the  disks  beeome  more  reliable  in  the  future,  the  MTBF  of 
the  system  eould  go  up  to  over  1,000  years  and  outage  down  to  less  than  0.1  hours  per  year 
before  other  faetors  start  dominating  the  eharaeteristies  of  the  system.  On  the  other  hand,  with 
unreliable  disks,  the  system  reliability  eould  go  down  very  quiekly.  For  instanee,  if  the  disks 
have  only  one  tenth  average  lifetime  than  what  the  manufaeturers  elaim,  the  outages  will  be 
44  hours  per  year  (99.5%  availability)  and  the  MTBF  will  be  about  1.3  years. 


66 


IDE  Drives 

Figure  6.11  shows  the  relationship  between  the  PC’s  internal  IDE  drive  failure  rates  and  the 
outages.  The  disk  failure  rates  we  observed  was  one  failure  every  7  years.  Manufaeturers  usu¬ 
ally  quote  about  300  thousand  hours  as  IDE  drive  MTBE,  whieh  is  about  34  years.  Sinee  we 
are  already  in  a  fairly  flat  part  of  the  graph,  our  higher  failure  rate  didn’t  affeet  the  availability 
mueh.  However,  without  double-ending,  the  IDE  drive  failures  have  a  profound  impaet  on 
system  availability;  see  Seetion  6.3.8  for  details. 


Eigure  6.11:  IDE  drive  failure  rates 
(Dotted  line  indicates  observed  value) 


UPS 

Eigure  6.12  shows  the  relationship  between  the  UPS  failure  rates  and  availability.  We  didn’t 
observe  any  UPS  failures  so  the  range  of  failure  rates  is  totally  arbitrary.  It  is  apparent  from 
the  graph  that  UPS  failures  does  not  have  a  large  impaet  on  our  system,  as  long  as  MTBE  is 
greater  than  1  year. 

Ethernet  Switches 

Eigure  6.13  shows  the  relationship  between  the  Ethernet  switeh  failure  rates  and  availability. 
We  did  observe  one  failure  out  of  two  switehes  in  three  years,  with  an  average  eomponent 
MTBE  of  6  years.  It  is  apparent  from  the  graph  that  Ethernet  switeh  failures  does  not  have  a 
large  impaet  on  our  system,  whieh  is  a  testimony  to  our  double-ended  eonfiguration. 

Others 

As  ean  be  seen  in  Eigure  6. 14,  the  failure  rate  of  SCSI  adapters  seem  to  have  a  very  little  effeet 
on  overall  MTBE.  Our  observed  rate  of  onee  per  90  years  is  already  in  a  very  flat  position  of 


Figure  6.12:  UPS  failure  rates 


Figure  6.13:  Ethernet  switeh  failure  rates 


68 


the  curve.  The  same  can  be  said  for  all  other  components,  namely  SCSI  cables,  disk  enclosure 
power  supplies,  disk  enclosure  SCSI  bus  integrity,  Ethernet  adapters  and  Ethernet  cables. 


Eigure  6.14:  SCSI  adapter  failure  rates 
(Dotted  line  indicates  observed  value) 


Eigure  6.15  shows  the  effect  of  SCSI  drives,  IDE  drives  and  SCSI  adapters.  Since  each 
component  has  different  observed  lifetimes,  I  normalized  the  horizontal  axis  to  align  their  average 
lifetime  along  our  observed  values.  Thus,  the  dotted  line  marked  “1”  is  our  observation.  It  can  be 
seen  from  this  graph  that  any  difference  that  reliability  of  IDE  drives  or  SCSI  adapters  make  are 
insignificant  compared  to  SCSI  disks. 

To  summarize  the  effect  of  component  failure  rates  on  system  reliability: 

•  SCSI  disks  have  by  far  the  largest  effect  on  system  reliability,  with  an  almost  linear  relation¬ 
ship. 

•  We  couldn’t  get  enough  data  on  UPS  units  and  Ethernet  switches,  but  judging  from  the  shapes 
of  curves,  they  don’t  seem  to  matter  much. 

•  No  other  component  has  a  significant  effect  on  the  system  reliability  either. 

6.3.3  Operator  Cost 

“How  much  time  does  the  operator  actually  spend  conducting  repairs?  Is  there  some  way  to  sched¬ 
ule  repairs  more  efficiently?” 

Eigure  6.16  shows  how  long  it  takes  for  an  operator  to  do  the  job.  The  two  lines  are 
minutes  per  visit  that  the  operator  spends  conducting  repairs.  One  of  them  is  divided  by  the  total 


69 


Figure  6.15:  Effect  of  component  failure  rates 


number  of  repair  intervals;  the  other  one  is  divided  by  the  number  of  times  that  the  operator  actually 
has  something  to  repair.  As  can  be  seen  from  the  upper  line,  the  latter  number  tails  off  at  about  half 
an  hour  per  visit. 

Figure  6.17  shows  the  ratio  of  repair  intervals  during  which  a  failure  has  occurred.  With 
a  one-week  repair  interval,  the  operator  has  to  conduct  a  repair  only  28%  of  the  time.  The  time 
between  failures  on  our  system  is  about  21  days.  In  other  words,  a  repair  is  necessary  once  every 
three  weeks,  and  as  the  repair  interval  decreases,  the  ratio  of  visits  that  are  actually  necessary  also 
decreases  quickly. 

See  the  next  section  for  discussion  on  how  to  efficiently  schedule  repairs  to  reduce  the 
cost  while  maintaining  high  availability. 

6.3.4  Repair  Interval 

“How  much  will  the  repair  interval  affect  reliability?” 

Figures  6.18  and  6.19  show  the  effect  of  varying  the  repair  interval.  As  mentioned  before, 
with  the  standard  1-week  interval  between  operator  visits,  the  MTBF  is  about  70  years.  By  more  fre¬ 
quent  visits,  we  can  increase  this  to  a  much  greater  number  and  reduce  the  downtime.  For  instance, 
with  a  once-per-day  visit,  the  downtime  per  year  will  be  0.03  hours,  or  2  minutes,  per  year,  and 
the  MTBF  will  be  260  years;  availability  is  99.9997%.  On  the  other  hand,  with  a  once-per-month 
visit,  the  outage  per  year  will  rise  to  15  hours.  That  is  more  than  a  fifteen-fold  increase  compared 
to  systems  with  weekly  visits.  Also,  the  MTBF  will  go  down  to  15  years.  The  MTBF  might  be  still 
acceptable  in  some  cases,  however,  the  outages,  averaging  just  under  10  days,  could  last  as  long  as 
a  full  month. 


Ratio  of  visits  involving  repairs  Repair  time  per  visit  (mini 


per  interval  that  requires  repair 


per  interval 


0.25 


1  hour 


6  hours 


1  day  2  days  1  week 
Repair  interval 


1  month  4  months 


Figure  6.16:  Operator  time 


Figure  6.17:  Visits  that  result  in  repair 


Figure  6.18:  Repair  interval  and  availability 
(Dotted  line  indicates  default) 


Figure  6.19:  Repair  interval  and  MTBF 
(Dotted  line  indicates  default) 


72 


With  a  once-per-year  visit,  the  system  will  have  a  MTBF  of  less  than  2  years.  That  means 
the  ehanee  that  the  system  is  operational  when  the  operator  shows  up  is  less  than  50%.  Also,  the 
outage  per  year  is  close  to  2,000  hours — almost  3  months. 

According  to  these  numbers,  by  picking  one  week  at  the  beginning  of  the  project,  it  ap¬ 
pears  we  have  made  an  extremely  lucky  guess  for  what  an  repair  interval  would  be  for  our  system 
that  keeps  administration  costs  down  while  having  excellent  availability.  If  we  wanted  the  highest 
availability,  we  would  hire  people  to  maintain  the  system  24  hours  a  day.  Such  vigilance  would 
probably  cut  MTTR  to  2  hours,  yielding  an  availability  of  99.999999%.  When  computer  manufac¬ 
turers  claim  99.999%  availability,  they  usually  assume  constant  attention  to  the  system. 

One  interesting  observation  is  that,  combined  with  the  results  from  the  preceding  section, 
it  should  be  able  to  effectively  reduce  the  repair  interval  without  actually  having  the  operator  visit 
the  site  every  time.  For  instance,  according  to  Figure  6.18,  an  18-hour  repair  interval  will  increase 
the  availability  of  the  system  to  99.9999%.  However,  Figure  6.17  show  that  repairs  are  necessary 
only  3.5%  of  the  time  with  the  18-hour  interval.  Note  that  the  ‘T8-hour  repair  interval”  here  means 
the  operator  has  to  come  in  within  18  hours  of  the  failure,  and  not  necessarily  every  18  hours  if  there 
is  nothing  to  fix. 

With  this  in  mind,  consider  the  following  arrangement  with  the  operator  that  we  call  Pager 
Arrangement,  as  opposed  to  the  traditional  Fixed  Interval  method: 

•  The  operator  carries  a  pager. 

•  If  the  system  pages  the  operator  between  7  AM  and  3  PM,  the  operator  should  come  in  and 
conduct  the  repair  before  5  PM  on  the  same  day. 

•  If  the  system  pages  the  operator  between  3  PM  and  7AM  in  the  next  morning,  the  operator 
should  come  in  and  conduct  the  repair  at  9  AM. 

According  to  the  Pager  Arrangement,  the  operator  will  only  have  to  come  in  once  every  21  days, 
or  three  weeks,  to  conduct  necessary  repairs.  Figure  6.16  also  shows  that  the  time  it  takes  for  the 
operator  to  do  the  job  is  roughly  constant  at  half  an  hour  for  intervals  less  than  a  week. 

Under  this  kind  of  arrangement,  the  availability  will  be  slightly  better  than  the  Fixed 
Interval  18-hour  repair  interval,  since  this  is  in  effect  a  combination  of  8-hour  and  18-hour  repair 
intervals.  See  Figure  6.20  for  a  comparison  between  the  Fixed  Interval  and  Pager  Arrangements.  In 
the  Fixed  Interval  Arrangement,  any  failure  that  happened  during  an  18 -hour  interval  will  be  fixed 
af  fhe  end  of  fhaf  inferval;  wifh  fhe  Pager  Arrangemenf,  any  failure  in  fhe  8-hour  period  befween 
7AM  and  3PM  gels  fixed  al  5PM  wifh  a  maximum  delay  of  10  hours,  and  any  failure  in  fhe  16-hour 
period  befween  3PM  and  7AM  will  be  repaired  al  9AM  wifh  a  maximum  delay  of  1 8  hours. 

As  a  variation  of  fhe  Pager  Arrangemenf,  fhe  operator  can  remotely  check  fhe  slalus  of  fhe 
syslem  al  3PM  and  7AM,  and  come  in  af  5PM  and  9AM,  respeclively,  if  Ihere  is  a  need  for  repair. 
In  Ibis  case,  fhe  operator  needs  a  nelwork  conneclion  instead  of  a  pager.  The  availability  of  Ibis 
syslem  will  be  identical  wifh  one  wifh  fhe  Pager  Arrangemenf. 

6.3.5  Disk  Striping 

“What  is  the  ejfect  of  disk  striping?” 

As  mentioned  in  Section  3.2.2,  we  use  disk  slriping  to  combine  multiple  disks  on  one  PC 
info  one  large  virlual  disk.  This  merging  is  a  Iradeoff  befween  operator  cosl  and  reliability:  if  is 


74 


easier  for  the  operator  to  handle  data  layout  issues,  espeeially  filesystem  full  eonditions,  when  there 
are  fewer  and  larger  “disks”.  It  is  also  easier  to  tune  the  performanee  of  the  system,  as  striping  will 
inherently  balanee  the  load  evenly  among  the  disks  in  the  stripe  set.  On  the  other  hand,  reliability 
will  suffer  due  to  a  disk  failure  making  the  whole  dataset  vulnerable. 


99.9% 


99.95% 


99.98% 


99.99% 


99.995% 


99.998% 


Number  of  disks  per  striped  set 

Figure  6.21:  Stripe  set  sizes 
(Dotted  line  indicates  default) 


Figure  6.21  shows  how  the  number  of  disks  on  a  striped  set  affeets  availability.  Our  default 
is  16,  meaning  the  32  disks  on  a  double-ended  pair  are  eombined  into  two  striped  sets,  one  per  PC. 
With  our  eurrent  software,  the  aetual  maximum  is  32,  sinee  we  do  not  have  a  network  filesystem 
or  some  other  network  striping  software.  Therefore,  the  figures  for  size  64  and  larger  are  purely 
hypofhelieal.  On  fhe  ofher  end  of  fhe  graph,  a  size  1  sfripe  sef  is  aefually  a  single  disk. 

As  ean  be  seen  from  fhe  figure,  as  fhe  sfriped  sef  size  inereases,  fhe  availabilify  gradually 
goes  down.  However,  even  af  fhe  maximum,  when  fhe  entire  eolleefion  of  disks  are  divided  info  only 
fwo  sfripe  sefs  (for  mirroring),  fhe  oufage  per  year  is  only  a  liffle  over  10  hours.  Wifh  no  sfriping,  fhe 
oufage  ean  be  redueed  fo  abouf  one  fhird  of  our  defaulf  eonfigurafion,  buf  wifh  a  signifieanl  penally 
in  operator  labor. 

6.3.6  Mirroring 

“How  valuable  is  mirroring?’’ 

Figure  6.22  shows  how  mirroring  will  affeel  availability.  1  ehanged  fhe  SCSI  disk  failure 
rales  and  plotted  fhe  oufage  per  year  for  no  mirroring,  2-way  mirroring,  3-way  mirroring  and  4-way 
mirroring  syslems.  If  a  parlieular  plof  ends  in  fhe  middle  of  fhe  graph,  if  means  fhe  oufage  has  been 
less  lhan  0.01  hours  per  years  to  fhe  righl  of  fhe  Iasi  poinl. 


75 


1000 


100 


10 


0.1 


0.01 


X 

2-way  mirror 

3-way  mirror 

X 

'■'Q 

4-way  mirror 

''X 

90% 


H  99% 


10  100  1000 
Average  time  between  SCSI  disk  failures  (years) 


H  99.99% 


H  99.999% 


99.9999% 


Figure  6.22:  Disk  failure  rates  and  mirroring 
(Dotted  line  indicates  observed  value) 


Without  mirroring  of  data,  the  MTBF  goes  down  to  1,300  hours  (0.16  years)  and  avail¬ 
ability  to  93.9%  (540  hours  of  downtime  per  year)  with  an  average  downtime  of  85  hours.  The 
dominating  faetor  is  disk  failures,  as  any  disk  going  down  will  eause  a  failure. 

With  extremely  reliable  disks,  the  outage  ean  be  redueed  to  about  100  hours  per  year,  but 
this  is  still  unaeeeptably  high.  With  less  reliable  disks,  the  numbers  are  positively  dismal  without 
mirroring.  On  the  other  hand,  with  3-way  or  4-way  mirrors,  the  reliability  goes  up  signifieantly. 

6.3.7  RAID-5 

“How  about  RAIDS?” 

Can  we  substitute  mirroring  between  maehines  with  a  parity-based  proteetion  seheme 
within  the  same  maehine  and  still  have  the  same  kind  of  availability?  The  answer  is  no. 

Aeeording  to  Figure  6.22,  with  no  mirroring,  the  availability  ean  only  improve  to  about 
99%  even  with  infinitely  reliable  disks.  A  parity-based  proteetion  seheme  within  an  enelosure  or  a 
SCSI  adapter  will  be  limited  by  this  number  as  an  upper  bound.  In  other  words,  a  system  without 
inter-maehine  data  proteetion  simply  eannot  be  made  reliable  enough  beeause  of  other  single  points 
of  failure  sueh  as  SCSI  buses. 

6.3.8  Double-Ending 

“How  valuable  is  double-ending  of  SCSI  disks?” 

With  no  double-ending,  i.e.,  with  eaeh  enelosure  only  eonneeted  to  one  PC,  the  availability 
goes  down  to  about  99.95%,  or  4  hours  of  outage  per  year.  This  level  is  about  five  times  what  the 


76 


current  system  offers.  The  MTBF  goes  down  to  about  14  years,  or  125,000  hours,  which  is  one 
quarter  of  the  current  system. 

Figures  6.23  and  6.24  show  how  double-ending  affects  availability.  The  horizontal  axis 
are  average  time  between  SCSI  drive  failures  and  IDE  drive  failures,  respectively. 


Average  time  between  SCSI  disk  failures  (years) 

Figure  6.23:  SCSI  drive  failure  rates  and  double-ending 
(Dotted  line  indicates  observed  value) 


We  can  see  from  Figure  6.23  that  SCSI  disk  failure  rates  affect  configurations  with  and 
without  double-ending.  When  the  disk  failure  rates  are  very  high,  the  two  graphs  coincide;  when  the 
main  reason  for  outages  are  data  disk  failures,  adding  connectivity  to  the  system  by  double-ending 
doesn’t  improve  the  overall  availability.  When  the  disk  failure  rates  are  better,  the  overall  system 
availability  will  increase,  but  without  double-ending,  the  improvement  is  limited,  as  the  line  without 
double-ending  tails  off  at  about  2  hours  per  year,  when  other  factors  start  dominating  the  availability. 

Figure  6.24  provides  a  partial  answer  to  that  limitation:  the  IDE  drive  failures  are  almost 
masked  out  when  the  SCSI  drive  are  double-ended,  but  without  double-ending,  it  greatly  affects  the 
system  reliability,  as  a  single  PC  failure  causes  a  few  dozen  disks  to  be  disconnected  temporarily. 

Eigure  6.25  shows  how  decreasing  the  number  of  disks  per  striped  set  will  affect  avail¬ 
ability.  Even  without  any  striping,  the  outage  per  year  can  only  be  reduced  to  about  3  hours  without 
double-ending.  The  reason  of  this  limitation  is  also  because  SCSI  disk  failures  are  no  longer  the 
main  cause  of  unreliability  in  the  case  without  double-ending. 

It  is  not  really  possible  to  create  a  system  without  double-ending  with  similar  reliability 
characteristics  as  our  current  system  given  our  failure  rates  by  just  adjusting  number  of  components. 
Eor  instance,  adding  more  PCs  will  reduce  the  number  of  disconnected  disks  when  one  of  the  PCs 
go  down,  but  increases  the  chance  that  one  of  the  PCs  will  be  down  at  any  given  moment.  The  only 
ways  to  significantly  reduce  the  failures  without  double-ending  are  to  reduce  the  repair  interval  or 


Without  double-ending 


Figure  6.24:  IDE  drive  failure  rates  and  double-ending 
(Dotted  line  indicates  observed  value) 


Figure  6.25:  Size  of  striped  sets  and  double-ending 
(Dotted  line  indicates  default) 


78 


increase  the  number  of  mirrors  of  data. 

Figure  6.26  shows  the  effect  of  repair  interval  on  failures  on  systems  with  or  without 
double-ending.  Repair  interval  of  3  days  will  reduce  the  outage  hours  per  year  to  about  0.8  on 
a  system  without  double-ending,  which  is  comparable  to  our  double-ending  system  with  a  repair 
interval  of  7  days.  As  the  two  lines  are  roughly  parallel,  for  the  Tertiary  Disk  system  it  appears  that 
double-ending  allows  you  to  double  the  repair  interval  and  yield  the  same  availability. 


Repair  interval  (days) 

Figure  6.26:  Repair  interval  and  double-ending 
(Dotted  line  indicates  default) 


Figure  6.27  shows  how  increasing  the  number  of  mirrors  affect  the  downtime.  With  three 
copies  of  data,  a  system  without  double-ending  can  be  made  more  available  than  a  double-ended 
system  with  the  normal  two-way  mirror. 

6.3.9  Operating  System  Crashes 

“Will  operating  system  crashes  affect  availability?  How  much  will  reboot  time  after  a  crash  affect 
availability?” 

Since  we  haven’t  experienced  any  machine  crashes  caused  by  operating  system  bugs  dur¬ 
ing  our  project,  the  generator  doesn’t  generate  OS-related  reboot  events  by  default.  However,  as 
mentioned  before,  it  can  be  made  to  generate  those  events  by  command-line  switches. 

Figures  6.28  and  6.29  show  the  effect  of  operating  system  crashes  with  the  original  reboot 
time  and  fast  reboot  improvements.  As  mentioned  in  Section  3.3.2,  the  original  reboot  time  was  22 
minutes;  with  the  fast  reboot  modifications,  it  was  reduced  to  1  minute. 

As  can  be  seen  from  Figure  6.28,  even  with  very  frequent  crashes,  there  was  virtually  no 
effect  on  outage  hours  per  year.  The  reason  why  operating  system  crashes  have  very  little  effect  on 


Number  of  mirrored  copies 


Figure  6.27:  Mirroring  and  double-ending 


With  fast  reboot 


3  7  1 4  30  90 

Average  time  between  operating  system  crashes  (days) 


H  99.98% 


99.9% 


99.95% 


H  99.99% 


99.995% 


99.998% 


180 


99.999% 


Figure  6.28:  Operating  system  erashes  and  availability 


80 


Figure  6.29:  Operating  system  erashes  and  MTBF 


outage  hours  is  beeause  the  reeovery  time  from  sueh  erashes  is  very  small  eompared  to  hardware 
failure  repairs;  thus,  even  if  the  system  suffers  an  outage,  it  will  reeover  very  quickly,  resulting  in 
little  extra  outage  time.  The  difference  between  systems  with  and  without  fast  reboot  are  within  the 
error  range. 

On  the  other  hand.  Figure  6.29  shows  that  the  MTBF  of  the  system  will  be  affected  with 
frequent  crashes.  With  operating  systems  that  crash  daily,  the  MTBF  will  be  down  to  about  1  year 
with  normal  reboots,  and  about  8  years  with  the  fast  reboot  improvements. 

6.3.10  Number  of  PCs  per  Disk 

“What  is  the  effect  of  decreasing  or  increasing  the  number  of  disks  per  PC?” 

Intuition  seems  to  suggest  that,  if  you  add  more  PCs  to  the  system,  the  availability  will 
improve  since  there  will  be  less  things  to  be  disconnected  when  a  PC  goes  down.  However,  this  is 
actually  not  the  case:  with  the  increase  in  numbers  of  PCs  comes  more  PC  failures. 

Figure  6.30  shows  how  little  adding  more  PCs  to  the  system  helps  availability.  Further¬ 
more,  even  with  only  two  PCs,  connected  to  all  the  disks  on  the  system  via  double-ending,  the 
system  availability  suffers  only  a  very  small  hit. 

6.3.11  Total  Size 

“How  will  a  system  10  times  as  big  as  the  current  one  behave?  How  can  1  make  it  as  reliable  as  the 
current  one?” 

If  we  were  to  build  a  system  that  has  more  components,  what  will  its  characteristics  be? 


81 


Figure  6.30:  Changing  number  of  PCs 


For  instance,  if  we  were  to  fill  up  the  entire  machine  room  with  ten  times  as  many  disks  and  PCs, 
how  will  it  behave? 

Figure  6.31  shows  how  availability  is  affected  by  the  total  system  size.  The  simulation 
indicates  that  degradation  in  availability  is  proportional  to  size.  For  instance,  the  outage  average  is 
7.67  hours  per  year,  about  ten  times  that  of  the  current  system,  for  a  tenfold  system. 

What  will  it  take  to  bring  such  a  system  under  control?  From  previous  sections,  we  know 
that  increasing  the  number  of  mirrors,  decreasing  the  stripe  set  size,  and  reducing  the  repair  interval 
are  the  only  options. 

I  found  that,  with  3-way  mirroring,  the  system’s  outage  will  be  down  to  less  than  0.1  hours 
per  year.  Note  that  replicating  data  in  more  places  will  reduce  the  effective  capacity  of  a  set  of  disks, 
so  the  cost  penalty  is  rather  severe.  For  instance,  a  3-way  mirror  will  cost  50%  more  than  a  2-way 
mirror  to  construct  given  the  same  net  storage  system  size. 

Figure  6.32  shows  the  effect  of  reducing  the  repair  interval  on  a  system  ten  times  the  size 
of  our  current  system.  To  bring  the  outage  down  to  a  comparable  level,  the  repair  interval  has  to  be 
2  days — the  operator  comes  in  three  times  as  often. 

However,  with  an  arrangement  similar  to  the  one  proposed  in  Section  6.3.4,  we  can  reduce 
the  operator  cost  without  compromising  reliability.  For  instance,  with  one-day  intervals  on  a  tenfold 
system,  38%  of  the  days  have  one  or  more  failure.  If  the  operator  carries  a  pager  and  only  comes  in 
when  the  system  had  a  failure  in  the  previous  24  hours,  this  person  only  needs  to  actually  make  a 
visit  2.7  days  per  week. 

Figure  6.33  shows  how  reducing  the  number  of  disks  per  striped  set  will  help.  By  not 
using  any  striping,  the  outage  can  be  reduced  to  about  twice  that  of  the  current  system.  However, 
the  burden  of  maintaining  5,120  one-disk  filesystems  would  be  great. 


83 


Figure  6.33:  Reducing  the  number  of  disks  per  striped  set 
(Dotted  line  indicates  default) 


99% 


99.5% 


99.8% 


99.9% 


99.95% 


99.98% 


99.99% 


6.3.12  99.999%  or  99.9999%  Availability 

“What  will  it  take  to  build  a  system  with  99.999%  availability  given  the  current  component  failure 
rates?  What  about  99.9999%’’ 

Some  commercial  systems  boast  “five  nines”,  i.e.,  99.999%  availability.  What  will  it  take 
to  build  such  a  system  using  only  commodity  components?  What  about  “six  nines”,  i.e.,  99.9999% 
availability? 

99.999%  availability  only  allows  0.09  hours,  or  5  minutes,  of  downtime  per  year,  and 
99.9999%  allows  0.009  hours,  or  30  seconds.  We  need  to  reduce  our  downtime  by  a  factor  of  10 
or  100  to  obtain  this.  From  previous  discussions,  it  is  clear  that  the  only  options  available  are  to 
increase  the  number  of  mirrored  copies  of  data  or  decreasing  the  repair  interval. 

With  current  failure  rates  and  configurations,  a  3-way  mirror  will  accomplish  the  task 
quite  easily.  In  fact,  with  a  3-way  mirror,  my  simulator  detected  only  66  hours  of  outage  in  100,000 
years,  or  2.4  seconds  of  downtime  per  year,  which  translates  to  99.999992%  availability.  A  3-way 
mirror  incurs  a  50%  capacity  penalty. 

The  other  option,  reducing  the  repair  interval,  can  get  us  to  99.999%  with  a  two-day  repair 
interval.  As  for  99.9999%,  we  can  obtain  that  goal  with  an  18-hour  repair  interval  as  mentioned  in 
Section  6.3.4. 


84 


6.4  Summary 

I  have  implemented  a  event-driven  simulator  which  constructs  a  DAG  to  represent  the 
system  and  simulates  its  behavior  efficiently.  Using  this  simulator,  and  a  random  event  generator 
based  on  the  data  I  collected  in  Chapter  5,  I  evaluated  the  availability  of  our  system  and  explored 
the  design  space. 

These  are  the  key  discoveries  presented  in  this  chapter. 

•  Our  system  appears  to  work  very  well  under  the  observed  failure  rates.  To  put  things  in  per¬ 
spective,  we  had  three  campus-wide  power  failures  or  network  outages  in  the  past  12  months, 
resulting  in  more  than  twenty  hours  of  disconnection  from  the  outside  world.  Compared  to 
these,  the  1  hour  per  year  outage  rate  of  our  system  is  miniscule. 

•  Of  all  component  reliabilities,  the  SCSI  disk  failure  rate  has  the  greatest  effect  on  system 
reliability,  with  system  reliability  being  a  near  linear  function  of  SCSI  disk  reliability.  This 
result  is  positive,  as  we  want  the  system  availability  to  be  a  function  of  the  storage  itself,  and 
not  some  less  visible  component. 

•  SCSI  disk  reliability  dominates  because  double-ending  effectively  masks  failures  of  every¬ 
thing  else  by  providing  extra  connectivity.  Without  double-ending,  the  system’s  reliability 
suffers  and  the  system  is  much  more  sensitive  to  component  failure  rates. 

•  Without  mirroring  data,  the  system  will  be  very  unreliable.  On  the  other  hand,  adding  the 
third  mirrored  copy  of  all  data  will  make  it  virtually  unbreakable. 

•  Adding  more  PCs,  and  thus  more  connectivity,  to  the  system,  does  not  affect  the  overall 
system  reliability  much.  The  reason  is  because  while  more  PCs  will  cause  less  disks  to  be 
affected  if  one  of  them  is  disconnected,  the  extra  PCs  will  increase  the  chance  that  one  or 
more  of  them  is  down  at  any  given  moment. 

•  It  is  possible  to  construct  a  system  ten  times  the  size  of  ours  using  the  same  equipment  and 
with  similar  reliability  measures  by  increasing  the  number  of  mirrors  of  data,  reducing  the 
numbers  of  disks  per  striped  set,  or  reducing  the  repair  interval. 

•  It  is  possible  to  increase  the  system  availability  to  99.999%  by  increasing  the  number  of  mir¬ 
rors  of  data  or  reducing  the  repair  interval.  To  improve  it  to  99.9999%,  the  only  option  is  to 
increase  the  number  of  mirrors  of  data  or  further  reducing  the  repair  interval.  However,  as 
mentioned  in  Section  6.3.4,  it  could  be  possible  to  reduce  the  repair  interval  without  increas¬ 
ing  the  operator  cost  too  much. 


85 


Chapter  7 

Conclusion 

7.1  Results 

In  this  dissertation,  I  presented  the  following: 

•  The  design  of  a  large  storage  system  built  using  eommodity  eomponents  only 

•  An  applieation  and  modiheations  to  the  operating  system  we  made  to  make  it  self-maintaining 
and  run  almost  unattended 

•  An  analysis  of  various  methods  to  reduee  system  administration  eost  of  large  storage  systems 

•  A  eatalog  of  eomponents  failures  we  observed  over  two  years  the  prototype  has  been  in  oper¬ 
ation 

•  An  evaluation  of  the  reliability  of  our  design  by  simulation  and  exploration  of  the  design 
spaee 

7.2  Contributions 

In  summary,  these  are  the  eontributions  of  this  researeh: 

•  Demonstration  of  eonstruetion  of  self-maintaining  Internet  serviee — ^both  via  experiment  on 
our  eonstrueted  system  and  via  simulation — with  mueh  lower  system  administration  eost,  and 
mueh  lower  hardware  eosts 

•  Showing  mirroring  level,  disk  reliability,  and  repair  time  are  the  three  key  parameters  in 
eonstrueting  highly  available  large  seale  storage  servers  with  enough  internal  eonneetivity 

•  Observations  on  how  to  arehiteet  a  system  so  that  it  ean  be  highly  available  yet  be  largely 
insensitive  to  reliability  of  hidden  infrastrueture  eomponents 

•  Use  of  DAG  to  represent  system  to  simplify  simulation  of  availability 

•  Reeording  failures  and  repairs  of  a  large  system  over  two  years  to  provide  parameters  to  an 
availability  simulator 


86 


7.3  What  We  Would  Have  Done  Differently 

There  were  several  design  ehoiees  we  made  when  we  drafted  the  initial  design  of  our 
system.  Some  of  them  were  foreed  by  eireumstanees,  as  other  alternatives  were  not  readily  available 
at  that  time,  while  some  were  our  eonseious  deeision.  This  seetion  will  summarize  those  and  explore 
other  possibilities  to  evaluate  if  we  would  build  a  system  differently  if  we  were  to  design  it  again 
today. 

7.3.1  Disk  Interface 

At  the  beginning  of  the  projeet,  we  had  only  one  ehoiee  for  the  disk  interfaee,  single- 
ended  SCSI.  Newer  and  better  standards,  sueh  as  differential  SCSI,  SSA  and  FC-AL  were  released 
but  eomponents  supporting  those  standards  had  not  yet  reaehed  “eommodity”  level. 

In  three  years,  the  landseape  has  ehanged  signifieantly.  SSA  merged  into  FC-AL,  but  the 
joined  standard  has  not  yet  reaehed  eommodity  status.  IDE,  whieh  was  onee  the  low-end  of  disk 
drives,  has  made  leaps  and  bounds  in  terms  of  reliability  and  espeeially  performanee.  However,  IDE 
still  has  a  two-disk-per-eable  restrietion.  Unless  we  are  going  to  a  design  with  signifieantly  fewer 
disk  drives  per  PC,  IDE  is  not  yet  a  viable  solution. 

As  for  SCSI,  EVD  (“low-voltage  differential”)  disks  have  beeome  widespread  in  the  last 
eouple  of  years.  EVD  eombines  the  benefits  of  differential  SCSI  and  a  lower  signal  voltage  to 
aehieve  an  even  higher  bus  speed  of  up  to  80MHz,  whieh  ean  transfer  up  to  160MB/s  on  a  16- 
bit  eable,  with  a  bus  that  is  mueh  more  eleetrieally  stable  than  the  original  single-ended  SCSI  bus 
running  at  20MHz. 

Given  that  we  had  a  fair  number  of  problems  with  SCSI  enelosure  and  eable  integrity, 
EVD-SCSI  equipment,  the  eurrent  standard  on  the  market,  would  be  a  better  ehoiee  today. 

7.3.2  Disk  Purchases 

During  the  eourse  of  the  projeet,  our  department  suffered  two  massive  failures  on  de¬ 
partmental  fileservers  despite  them  having  RAID-5  proteetion.  One  of  them  was  eaused  by  disk 
firmware  bugs.  Apparenfly  fhe  parfieular  revision  of  firmware  on  fhe  disks  had  a  problem  fhaf  will 
manifesl  ilself  while  doing  a  RAID-5  reeonsfruefion.  Thus,  when  one  of  fhe  disk  failed  and  fhe 
aufomafie  reeovery  proeess  sfarfed,  fhe  entire  array  was  losf. 

Allhough  we  have  nol  had  sueh  ealaslrophie  failures,  we  had  our  share  of  disk  firmware 
problems  as  menfioned  in  Seefion  5.3.3.  I  have  also  personally  experieneed  disk  problems  related 
fo  firmware  on  SCSI  disks  I  own — I  had  fo  upgrade  some  of  fhem  fo  make  fhe  disks  work  reliably. 

The  biggesf  issue  wifh  disk  firmware  problems  is  fhaf  fhey  oflen  will  nol  show  up  immedialely 
sometimes  fhey  will  nol  appear  until  fhe  mosl  inopporlune  momenl,  sueh  as  RAID-5  reeonsfruefion 
on  our  deparlmenlal  fileserver.  Unforlunalely,  Ihose  are  oflen  also  limes  of  emergeney,  wifh  fhe 
system’s  vulnerabilily  is  already  high. 

In  order  fo  safeguard  againsl  sueh  problems,  wifh  a  system  like  ours,  if  mighl  be  a  good 
idea  fo  use  disks  of  differenl  brands.  Eor  inslanee,  if  we  purehase  disks  from  Iwo  manufaelurers, 
use  fhe  same  disks  fo  eonslruel  a  slriped  sel  and  make  sure  fhe  Iwo  mirrored  eopies  of  dala  reside  on 
disks  from  differenl  manufaelurers,  we  will  be  able  fo  survive  a  ealaslrophie  failure  eaused  by  disk 
firmware  problems. 


87 


7.3.3  Disk  Enclosures 

By  far  the  hardest  part  of  setting  up  the  system  initially  was  dealing  with  disk  enelosures. 
There  were  numerous  problems  with  them,  ineluding  SCSI  baekplane  integrity  problems,  missing 
eables,  wrong  type  of  mounting  rails,  and  so  on. 

After  sending  almost  half  of  them  baek  to  the  manufaeturer  for  repair,  the  enelosures 
have  been  working  reasonably  well,  with  only  three  additional  failures  in  two  years.  However,  my 
experienee  shows  that  moving  the  enelosures  will  sometimes  eause  them  to  stop  working,  and  we 
have  been  unable  to  move  the  Tertiary  Disk  prototype  to  another  loeation  beeause  of  that  reason. 

However,  I  am  not  sure  if  we  eould  have  done  anything  differently  with  regards  to  disk 
enelosures.  We  have  tested  several  of  the  enelosures  before  ordering  them  in  full — in  faet,  the  ones 
we  initially  tested  are  still  working  fine  to  this  day  on  stampede,  our  file  server,  wifhouf  any  SCSI 
bus  infegrily  related  problems.  The  manufaelurer  apparenfly  ehanged  somefhing  wifh  fheir  design 
befween  fhose  and  fhe  ones  we  gol  in  large  quantify,  and  fhaf  eaused  fhe  problem.  As  disk  enelosures 
are  relafively  low-volume  produels,  Ihese  ehange  of  designs  are  fairly  frequenl  and  fhere  is  little  a 
purehaser  ean  do  abouf  if,  exeepl  eheeking  referenees  of  fhe  manufaelurer. 

In  anolher  note,  as  menfioned  in  Seelion  1.3.3,  many  newer  disk  enelosures  have  envi- 
ronmenfal  moniloring  syslems  ealled  SCSI  Environmenlal  Serviees  (SES)  whieh  ean  be  aeeessed 
Ihrough  fhe  SCSI  bus.  To  build  a  self-mainlaining  slorage  syslem,  SES  will  be  very  useful. 

7.3.4  PC  Cases 

We  used  ordinary  eompufer  eases  for  our  PCs.  In  fael,  Ihese  are  “ordinary”  PCs  in  fhe 
Iruesl  form  of  fhe  word — Ihey  were  pari  of  a  large  donation  lo  fhe  deparlmenf  and  mosl  of  fhe  olhers 
wenl  info  graduale  sludenls’  offiees.  There  are  fwo  problems  wifh  Ihem. 

The  firsl  problem  is  fhaf  Ihey  are  nol  very  spaee-effieienl.  Eor  our  system,  we  did  nol  need 
many  PCs  so  if  wasnT  a  big  problem,  buf  if  fhe  PC  lo  disk  ratio  is  higher,  Ihe  spaee  ineflieieney  will 
be  a  serious  issue. 

The  seeond  problem  is  lhal  Ihey  are  nol  very  easy  lo  handle,  as  Ihese  PCs  are  designed  lo 
be  used  in  ordinary  offiees,  and  nol  lo  be  inslalled  in  maehine-room  raeks.  The  operator’s  job  was 
made  mueh  harder  when  Ihere  was  a  repair  lhal  involves  ehanging  a  eomponenl  inside  one  of  Ihe 
PCs. 

Bolh  problems  ean  be  solved  by  using  raek-mounl  PC  eases.  Allhough  Ihese  eases  are 
more  expensive  lhan  ordinary  PC  eases — raek-mounl  PC  eases  eosl  aboul  $250  while  ordinary 
eases  eosl  $50  to  $100 — Ihe  reduelion  in  spaee  and  operator  burden  should  more  lhan  make  up  for 
Ihe  eosl  inerease. 


7.4  Future  Work 

There  are  some  issues  lhal  have  nol  been  fully  resolved  in  our  system.  This  seelion  will 
lisl  Iwo  of  Ihose  and  suggesl  fulure  direelions  for  potentially  interesting  researeh. 


7.4.1  Completion 


Some  of  the  improvements  listed  in  this  dissertation,  sueh  as  bypassing  the  BIOS  dur¬ 
ing  booting  and  a  PC  automatieally  taking  over  for  its  double-ended  eompanion,  are  not  fully  im¬ 
plemented.  Some  of  these  should  be  eompleted  before  the  system  ean  be  eonsidered  truly  self- 
maintaining. 

7.4.2  Design  Parameters 

The  experimentation  with  different  design  parameters  are  truly  hypothetieal  at  this  point. 
For  instanee,  we  have  argued  that  one  part-time  operator  ean  maintain  a  5,000-disk  system  without 
mueh  diffieulty.  However,  that  analysis  was  based  on  extrapolation  only,  assuming  the  diffieulty 
of  tasks  will  inerease  proportionally  with  the  system  size.  Sueh  assumptions  are  not  always  true, 
as  there  are  other  faetors,  sueh  as  psyehologieal  effeets  the  system  size  has  on  the  operator,  the 
likelihood  of  the  operator  getting  eonfused  about  whieh  part  of  the  system  needs  repair,  that  ean  not 
be  measured  without  extensive  researeh  in  those  areas. 


89 


Bibliography 


[ACPtNT95]  Thomas  E.  Anderson,  David  Culler,  David  Patterson,  and  the  NOW  Team.  A  ease  for 
Networks  of  Workstations.  IEEE  Micro,  pages  54-64,  February  1995. 

[ADN+95]  Tom  Anderson,  Miehael  Dahlin,  Jeanna  Neefe,  David  Patterson,  Drew  Roselli,  and 
Randy  Wang.  Serverless  network  file  systems.  In  Proceedings  of  the  15th  Symposium 
on  Operating  Systems  Principles,  1995. 

[AH85]  Robert  G.  Angus  and  Ronald  J.  Heslegrave.  Effeets  of  sleep  loss  on  sustained  eog- 
nitive  performanee  during  a  eommand  and  eontrol  simulation.  Behavior  Research 
Methods,  Instruments  &  Computers,  17(l):55-67,  1985. 

[And97]  Erie  Anderson.  System  Administration:  Monitoring,  Diagnosing,  and  Repairing. 

Ph.D.  Qualifying  exam  Proposal,  available  from  http:  //www.  cs  .berkeley . 
edu/ "eanders/,  April  1997. 

[AP97]  Erie  Anderson  and  Dave  Patterson.  Extensible,  sealable  monitoring  for  elusters  of 
eomputers.  In  Proceedings  of  the  11th  Systems  Administration  Conference  (LISA 
’97),  pages  9-16,  Oetober  1997. 

[Asa99]  Satoshi  Asami.  GridPix:  the  Interactive  Tile-based  Image  Viewer,  http: //now. 
cs.berkeley.edu/Td/GridPix/,  1999. 

[ATP99]  Satoshi  Asami,  Nisha  Talagala,  and  David  Patterson.  Designing  a  self-maintaining 
storage  system.  In  Proceedings  of  the  Sixteenth  IEEE  Symposium  on  Mass  Storage 
Systems,  pages  222-233,  Mareh  1999. 

[Bak93]  Mary  Baker.  Fast  erash  reeovery  in  disttibuted  file  systems  (Ph.D.  thesis).  Teehnieal 
Report  UCB/CSD-94-787,  UC  Berkeley,  1993. 

[Bar81]  Joel  F.  Bartlett.  A  NonStop  kernel.  In  Proceedings  of  the  1st  Symposium  on  Operating 
Systems  Principles,  1981. 

[BBC+90]  Joel  Bartlett,  Wendy  Bartlett,  Riehard  Carr,  Dave  Gareia,  Jim  Gray,  Robert  Horst, 
Robert  Jardine,  Dan  Eenoski,  and  Dix  MeGuire.  Fault  toleranee  in  Tandem  eomputer 
systems.  Teehnieal  Report  90.5,  Part  Number  40666,  Tandem  Computers,  Ine.,  1990. 

[BBD+96]  William  J.  Bolosky,  Joseph  S.  Barrera,  III,  Riehard  P.  Draves,  Robert  P.  Fitzger¬ 
ald,  Garth  A.  Gibson,  Miehael  B.  Jones,  Steven  P.  Eevi,  Nathan  P.  Myhrvold,  and 


90 


[BFD97] 

[BGM+96] 

[BHK+91] 

[Bia98] 

[BL98] 

[BSD94] 

[CGP+88] 

[CNC+96] 

[DK93] 

[dSG93] 

[FAM96] 

[FB93] 

[FG99] 


Richard  F.  Rashid.  The  Tiger  video  fileserver.  In  Proceedings  of  the  6th  International 
Workshop  on  Network  and  Operating  System  Support  for  Digital  Audio  and  Video 
(NOSSDAV  96),  April  1996. 

William  J.  Bolosky,  Robert  R  Fitzgerald,  and  John  R.  Douceur.  Distributed  schedule 
management  in  the  Tiger  video  fileserver.  In  Proceedings  of  the  16th  Symposium  on 
Operating  Systems  Principles,  pages  212-223,  1997. 

Elizabeth  Borowsky,  Richard  Golding,  Arif  Merchant,  Elizabeth  Shriver,  Mir- 
jana  Spasojevic,  and  John  Wilkes.  Eliminating  storage  headaches  through  self¬ 
management.  In  Proceedings  of  the  1996  OSDl,  October  1996. 

Mary  Baker,  John  Hartman,  Mike  Kupfer,  Ken  Shirriff,  and  John  Ousterhout.  Mea¬ 
surements  of  a  distributed  file  system.  In  Proceedings  of  the  13th  Symposium  on 
Operating  System  Principles,  pages  198-212,  October  1991. 

Andrzej  Bialecki.  PicoBSD:  fitting  FreeBSD  on  a  single  floppy.  http://www. 
f  reebsd .  org/  “picobsd/,  1998. 

Eric  Brewer  and  Ken  Eufz.  Private  communication.  Inktomi  Corp.,  1998. 

Disklabel  -  read  and  wrife  disk  pack  label.  In  4.4  BSD  System  Manager’s  Manual. 
O’Reilly  &  Associates,  1994. 

Peter  Ming-Chien  Chen,  Garfh  A.  Gibson,  David  A.  Patterson,  Randy  H.  Kafz,  and 
Marlin  E.  Schulze.  Two  papers  on  RAIDs.  Technical  reporl.  University  of  California 
al  Berkeley,  1988. 

Peter  M.  Chen,  Wee  Teck  Ng,  Subhachandra  Chandra,  Christopher  Aycock,  Gu- 
rushankar  Rajamani,  and  David  Eowell.  The  Rio  file  cache:  Surviving  operating 
system  crashes.  In  Proceedings  of  the  7th  International  Conference  on  Architectural 
Support  for  Programming  Languages  and  Operating  Systems  (ASPLOS),  pages  74- 
83,  October  1996. 

Ann  E.  Drapeau  and  Randy  H.  Kafz.  Slriped  lape  arrays.  Technical  reporl,  UC 
Berkeley,  1993. 

James  da  Silva  and  Olafur  Guomundsson.  The  amanda  nelwork  backup  manager.  In 
Proceedings  of  LISA,  pages  171-182,  November  1993. 

The  Art  ImageBase.  Eine  Arts  Museums  of  San  Erancisco,  http://www. 
thinker .  org/imagebase/,  October  1996. 

The  EreeBSD  Projecl,  http :  //www .  FreeBSD  .  org/,  1993. 

Slephen  H.  Eairclough  and  Roberl  Graham.  Impairmenl  of  driving  performance 
caused  by  sleep  deprivation  or  alcohol:  A  comparalive  sludy.  Human  Factors, 
41(1):118-128,  March  1999. 


91 


[Fin97]  Raphael  A.  Finkel.  Pulsar:  An  extensible  tool  for  monitoring  large  Unix  sites.  IEEE, 
1997. 


[Gib92] 

[GM99] 

[GNA+96] 

[GNA+98] 

[Gra85] 

[Gra90] 

[GS91] 

[HA93] 

[HD89] 

[Her97] 

[HP96a] 

[HP96b] 

[HWS98] 


Garth  Gibson.  Redundant  Disk  Arrays:  Reliable  Parallel  Secondary  Storage.  The 
MIT  Press,  1992. 

Justin  Gibbs  and  Kenneth  Merry.  Private  communication.  The  EreeBSD  Projeet, 
1996-1999. 

Garth  Gibson,  David  E.  Nagle,  Khalil  Amiri,  Eay  W.  Chang,  Eugene  Eeinberg, 
Howard  Gobioff,  Chen  Eee,  Berend  Ozeeri,  Erik  Riedel,  and  David  Roehberg.  A 
ease  for  network-attaehed  seeure  disks.  Teehnieal  Report  CMU-CS-96-142,  Carnegie 
Mellon  University,  1996. 

Garth  Gibson,  David  E.  Nagle,  Khalil  Amiri,  Jeff  Butler,  Eay  W.  Chang,  Howard  Gob¬ 
ioff,  Charles  Hardin,  Erik  Riedel,  David  Roehberg,  and  Jim  Zelenka.  A  eost-effeetive, 
high-bandwidth  storage  arehiteeture.  In  Proceedings  of  the  8th  International  Confer¬ 
ence  on  Architectural  Support  for  Programming  Languages  and  Operating  Systems 
(ASPLOS),  1998. 

Jim  Gray.  Why  do  eomputers  stop  and  what  ean  be  done  about  it?  Teehnieal  report. 
Tandem  Computers,  Ine.,  1985. 

Jim  Gray.  A  eensus  of  tandem  system  availability  between  1985  and  1990.  IEEE 
Transactions  on  Reliability,  39(4):409-418,  Oetober  1990. 

Jim  Gray  and  Daniel  P.  Siewiorek.  High-availability  eomputer  systems.  IEEE  Com¬ 
puter,  24(9),  September  1991. 

Stephen  E.  Hansen  and  E.  Todd  Atkins.  Automated  system  monitoring  and  notifiea- 
tion  with  swateh.  In  Proceedings  of  LISA,  pages  145-155,  November  1993. 

Hui-I  Hsiao  and  David  J.  DeWitt.  Chained  deelustering:  A  new  availability  strategy 
for  multiproeessor  database  maehines.  Teehnieal  report.  University  of  Wiseonsin, 
June  1989. 

Gary  Herbst.  IBM’s  drive  temperature  indieator  proeessor  (Drive- 

TIP)  helps  ensure  high  drive  reliability.  Teehnieal  report,  IBM  Corp., 

http : / /www .storage . ibm . com/hardsoft/diskdrdl/technolo/ 
drivetemp/drivete%mp . pdf , Oetober  1997.  whitepaper. 

John  E.  Hennesy  and  David  A.  Patterson.  Computer  Architecture:  A  Quantitative 
Approach,  pages  18-28.  Morgan  Kaufmann  Publishers,  Ine.,  2nd  edition,  1996. 

John  E.  Hennesy  and  David  A.  Patterson.  Computer  Architecture:  A  Quantitative 
Approach,  pages  29-34.  Morgan  Kaufmann  Publishers,  Ine.,  2nd  edition,  1996. 

G.  Robert  J.  Hoekey,  David  G.  Wastell,  and  Jurgen  Sauer.  Effeets  of  sleep  deprivation 
and  user  interfaee  on  eomplex  performanee:  A  multilevel  analysis  of  eompensatory 
eontrol.  Human  Eactors,  40(2):233-253,  June  1998. 


92 


[IBM98] 

[Jir99] 

[KMHPP97] 

[KMR] 

[LC97] 

[LT96] 

[McK84] 

[McK85] 

[McK98] 

[MFM95] 

[MKF83] 

[ML98] 

[NA92] 

[OD89] 

[Ous94] 

[Pau89] 

[RD98] 


Self-Monitoring  Analysis  and  Reporting  Technology  (S.M.A.R.T).  IBM  Corp., 
http  :  /  /www .  pc  .  ibm .  com/us / inf  obrf/ibsmart .  html,  1998. 

Jirobase  management  platform  speeifieation.  Teehnieal  report,  Sun  Mierosystems, 
Oetober  1999.  white  paper,  available  from  http  :  /  /www .  j  iro  .  com/. 

Eastman  Kodak,  Microsoft,  Hewlett-Packard,  and  Live  Pictures.  FlashPix  Standard. 
July  1997. 

T.  Kwan,  R.  McGrath,  and  D.  Reed.  User  access  patters  to  NCSA’s  world  wide  web 
server.  Technical  report. 

David  E.  Lowell  and  Peter  M.  Chen.  Eree  transactions  with  Rio  Vista.  In  Proceedings 
of  the  17th  Symposium  on  Operating  System  Principles,  October  1997. 

Edward  K.  Lee  and  Chandramohan  A.  Thekkath.  Petal:  Distributed  virtual  disks.  In 
Proceedings  of  the  7th  International  Conference  on  Architectural  Support  for  Pro¬ 
gramming  Languages  and  Operating  Systems  (ASPLOS),  pages  84-92,  1996. 

Marshall  Kirk  McKusick.  A  fast  file  system  for  UNIX.  ACM  Transactions  on  Com¬ 
puter  Systems,  pages  181-197,  August  1984. 

Marshall  Kirk  McKusick.  Fsck  -  The  UNIX  File  System  Check  Program.  4.2BSD, 
Computer  Systems  Research  Group,  UC  Berkeley,  1985. 

Marshall  Kirk  McKusick.  Private  communication.  1998. 

Dan  Mosedale,  William  Loss,  and  Rob  McCook  Administering  very  high  volume 
internet  services.  In  Proceedings  of  LISA  IX,  pages  95-102,  September  1995. 

Daniel  J.  Mullaney,  Daniel  E.  Kripke,  and  Paul  A.  Eleck.  Sleep  loss  and  nap  effects 
on  sustained  continuous  performance.  Psychophysiology,  20(6): 643-651,  1983. 

Live  Linux.  Media  Lab.,  Inc.,  http  :  /  /  www .  mlb .  co  .  jp/ live/,  June  1998. 

Network  Appliance,  Inc.,  http  :  /  /www .  netapp .  com/,  1992. 

John  Ousterhout  and  Ered  Douglis.  Beating  the  I/O  bottleneck:  A  case  for  log- 
structured  file  systems.  Operating  Systems  Review,  23(1):  11-27,  January  1989. 

John  Ousterhout.  Tcl  and  the  Tk  Toolkit.  Addison-Wesley,  1994. 

Paul  B.  Paulus,  editor.  Psychology  of  Group  Influence,  pages  15-51.  Lawrence  Erl- 
baum  Associates,  Inc.,  2nd  edition,  1989. 

Yansong  (Jennifer)  Ren  and  Joanne  Bechta  Dugan.  Design  of  reliable  systems  us¬ 
ing  static  &  dynamic  fault  trees.  IEEE  Transactions  on  Reliability,  47(3): 234— 244, 
September  1998. 


93 


[SBMS85] 

[Sch95] 

[Sha95] 

[SMH95] 

[SR93] 

[TAAP98] 

[Tal99] 

[Tan97a] 

[Tan97b] 

[TAP+98] 

[TML97] 

[Wal98] 

[WID97] 

[Wil98] 

[X3T] 

[Zaj66] 


Margo  Seltzer,  Keith  Bostie,  Marshall  M.  MeKusiek,  and  Carl  Staelin.  An  imple¬ 
mentation  of  a  log-struetured  file  system  for  unix.  In  Proceedings  of  the  1993  Winter 
USENIX,  pages  119-130,  June  1985. 

Jurgen  Sehonwalder.  Scotty — Tcl  Extensions  for  Network  Management  Applications. 
http :  //wwwhome  .cs.utwente.nl/~schoenw/scotty/,  1995. 

Tom  Shanley.  PCI  System  Architecture.  Addison- Wesley,  Reading,  Mass.,  3rd  edi¬ 
tion,  1995. 

Miehael  Shaddoek,  Miehael  Mitehell,  and  Helen  Harrison.  How  to  upgrade  1500 
workstations  on  Saturday,  and  still  have  time  to  mow  the  yard  on  Sunday.  In  Pro¬ 
ceedings  of  the  Ninth  USENIX  Conference  on  System  Administration,  pages  59-65, 
1995. 

Network  Buyer’s  Guide.  Strategie  Researeh  Corp.,  http://www. 

networkbuyersguide .  com/,  1993. 

Nisha  Talagala,  Satoshi  Asami,  Tom  Anderson,  and  David  Patterson.  Large  seale 
distributed  storage.  Teehnieal  Report  UCB/CSD-98-989,  UC  Berkeley,  January  1998. 

Nisha  Talagala.  Parameterizing  disk  drives  (Ph.D.  thesis).  Teehnieal  report,  UC 
Berkeley,  1999. 

NonStop  availability  for  NonStop  Himalaya  and  Windows  NT  server  systems.  Teeh¬ 
nieal  report.  Tandem  Computers,  1997.  white  paper. 

Tandem  maintenanee  and  diagnostie  system  (TMDS).  Teehnieal  report.  Tandem 
Computers,  1997.  white  paper. 

Nisha  Talagala,  Satoshi  Asami,  David  Patterson,  Bob  Futemiek,  and  Dakin  Hart. 
The  Berkeley-San  Franeiseo  Fine  Arts  Database.  In  Proceedings  of  the  Eifteenth 
IEEE  Symposium  on  Mass  Storage  Systems,  pages  163-167,  Mareh  1998. 

Chandramohan  A.  Thekkath,  Timothy  Mann,  and  Edward  K.  Lee.  Frangipani:  A 
scalable  distributed  file  sysfem.  In  ACM  Symposium  on  Operating  Systems  Principles, 
pages  224-237,  1997. 

Jim  Waldo.  Jini  archifecfure  overview.  Technical  reporf.  Sun  Microsysfems,  1998. 
while  paper. 

DHCP  WIDE-Implementation.  WIDE  Projecl,  http://www.wide.ad.jp/ 
sof  tware/README  .  dhcp .  txt,  April  1997. 

John  Wilkes.  Private  communication.  1998. 

X3T9.2  Commilfee.  SCSI-2  Standard. 

Roberl  B.  Zajonc.  Social  Psychology:  An  Experimental  Approach,  pages  10-15. 
Wadsworfh  Publishing  Company,  Inc.,  1966. 


